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Abstract 


Real-world  planning  problems  often  feature  multiple  sources  of  uncer¬ 
tainty,  including  randomness  in  outcomes,  the  presence  of  adversarial  agents, 
and  lack  of  complete  knowledge  of  the  world  state.  This  thesis  describes  algo¬ 
rithms  for  four  related  formal  models  that  can  address  multiple  types  of  uncer¬ 
tainty:  Markov  decision  processes,  MDPs  with  adversarial  costs,  extensive- 
form  games,  and  a  new  class  of  games  that  includes  both  extensive-form 
games  and  MDPs  as  special  cases. 

Markov  decision  processes  can  represent  problems  where  actions  have 
stochastic  outcomes.  We  describe  several  new  algorithms  for  MDPs,  and  then 
show  how  MDPs  can  be  generalized  to  model  the  presence  of  an  adversary 
that  has  some  control  over  costs.  Extensive-form  games  can  model  games  with 
random  events  and  partial  observability.  In  the  zero-sum  perfect-recall  case, 
a  minimax  solution  can  be  found  in  time  polynomial  in  the  size  of  the  game 
tree.  However,  the  game  tree  must  “remember”  all  past  actions  and  random 
outcomes,  and  so  the  size  of  the  game  tree  grows  exponentially  in  the  length 
of  the  game.  This  thesis  introduces  a  new  generalization  of  extensive-form 
games  that  relaxes  this  need  to  remember  all  past  actions  exactly,  producing 
exponentially  smaller  representations  for  interesting  problems.  Further,  this 
formulation  unifies  extensive-form  games  with  MDP  planning. 

We  present  a  new  class  of  fast  anytime  algorithms  for  the  off-line  computa¬ 
tion  of  minimax  equilibria  in  both  traditional  and  generalized  extensive-form 
games.  Experimental  results  demonstrate  their  effectiveness  on  an  adversarial 
MDP  problem  and  on  a  large  abstracted  poker  game.  We  also  present  a  new 
algorithm  for  playing  repeated  extensive-form  games  that  can  be  used  when 
only  the  total  payoff  of  the  game  is  observed  on  each  round. 
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Chapter  1 
Introduction 


The  goal  of  this  thesis  is  to  design  powerful  modeling  frameworks  and  general  purpose,  ef¬ 
ficient  algorithms  for  reasoning  about  uncertain  environments.  We  draw  upon  techniques 
from  the  theory  of  Markov  decision  processes  (MDPs)  and  partially  observable  MDPs 
(POMDPs),  reinforcement  learning,  online  learning  (experts  and  bandit  algorithms),  and 
game  theory  in  purist  of  this  goal.  This  section  introduces  the  major  topics  and  results  of 
the  thesis,  and  broadly  places  the  work  in  the  context  of  other  research.  Detailed  discus¬ 
sions  of  related  work  as  well  as  most  citations  will  be  deferred  to  the  relevant  chapters. 
Figure  (1.1)  summarizes  the  principal  problem  models  we  will  consider. 

Our  investigation  into  planning  takes  root  in  a  fertile  body  of  previous  work,  for  plan¬ 
ning  problems  have  inspired  a  rich  tradition  in  both  AI  and  operations  research.  Simple 
problems  like  the  shortest  path  problem  gave  way  to  the  broad  field  of  (deterministic)  AI 
planning,  which  includes  general-purpose  search  algorithms  like  A*  as  well  as  algorithms 
for  specialized  but  higher  dimensional  STRIPS-style  problems  [Russell  and  Norvig,  2003, 
Blum  and  Furst,  1997,  Weld,  1999].  Researchers  realized  early  on  that  purely  determin¬ 
istic  planning  would  not  be  sufficient  for  many  problems  of  real-word  interest.  Markov 
decision  processes  (MDPs)  were  one  of  the  earliest  formulations  of  planning  problems 
with  uncertainty.  Books  by  Bellman  [1957]  and  Howard  [1960]  brought  greater  exposure 
to  the  framework,  and  MDPs  continue  to  make  regular  appearances  in  the  AI  and  planning 
literature. 


Markov  decision  processes  MDPs  can  be  used  to  represent  a  wide  array  of  planning 
problems.  Generally,  they  model  uncertainty  by  describing  the  effect  of  a  particular  action 
as  a  distribution  over  possible  outcomes,  rather  than  assuming  a  single  deterministic  out- 
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Figure  1.1:  Relationships  between  problem  models  considered  in  this  thesis.  Arrows  point 
to  more  general  models;  MDPs  with  adversarial  costs  and  Convex  EFGs  are  introduced  in 
this  thesis. 


come.  We  refer  to  this  type  of  uncertainty  as  outcome  uncertainty  (sometimes  also  called 
action  uncertainty).  In  Chapter  2,  we  describe  the  MDP  model  more  thoroughly,  and 
discuss  several  new  algorithms  that  offer  advantages  over  previously  known  techniques. 
These  algorithms  are  particularly  effective  for  solving  problems  that  have  relatively  few 
stochastic  states  and  where  only  a  small  fraction  of  the  state-space  is  relevant.  Both  of 
these  are  common  characteristics  of  real-world  problems. 


Zero-sum  game  theory  The  MDP  model  assumes  that  the  uncertainty  in  the  world  is 
stochastic,  that  is,  that  the  uncertainty  over  outcomes  is  defined  by  some  probability  dis¬ 
tribution,  even  if  that  distribution  is  not  initially  known.  This  approach  is  valid  in  many 
cases:  stochastic  models  can  effectively  describe  sensor  noise,  wheel-slippage  on  a  mobile 
robot,  the  arrival  of  jobs  for  a  computing  cluster — in  fact,  the  list  could  go  on  for  pages,  as 
modern  research  in  AI  is  full  of  clever  uses  of  stochastic  modeling  and  probability  theory. 
But,  consider  a  game  of  chess:  as  we  contemplate  our  move,  we  will  have  uncertainty 
about  what  response  our  opponent  will  make.  Accurately  modeling  this  response  with 
a  probability  distribution  would  essentially  require  a  complete  model  of  our  opponent’s 
thought  process,  something  that  is  not  typically  available!  Instead,  a  more  plausible  ap¬ 
proach  is  to  attempt  to  select  a  move  so  that  no  matter  what  our  opponent  does  we  win 
(assuming  such  a  move  exists). 
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Game  theory  offers  a  variety  of  zero-sum  models  that  capture  this  type  of  worst-case 
reasoning.  The  term  zero-sum  implies  there  are  two  players  and  that  at  the  end  of  the  game 
the  payoff  to  one  player  is  the  negative  of  the  payoff  to  the  other  player — that  is,  the  payoffs 
sum  to  zero.  This  captures  a  purely  adversarial  model  of  interaction,  since  we  can  imagine 
the  payoff  as  a  value  that  one  player  has  to  pay  to  another  player.  Most  common  two-player 
games  played  by  people  are  in  fact  zero-sum.  However,  it  is  worth  emphasizing  that  the 
value  of  considering  zero-sum  games  comes  not  from  the  ability  to  model  parlor  games. 
Rather,  our  interest  is  in  using  the  tools  of  zero-sum  game  theory  to  reason  about  types  of 
uncertainty  where  stochastic  models  are  not  available.  In  particular,  we  can  reason  about 
the  worst-case  over  a  set  of  potential  eventualities,  rather  than  reasoning  in  expectation 
about  these  possibilities  given  a  fixed  probability  distribution. 

This  kind  of  strictly  worse-case  analysis  can  be  overly  pessimistic — in  many  situations 
there  are  extremely  unlikely  events^  that  may  dramatically  sway  our  plans  if  we  assume 
an  adversary  is  free  to  force  one  of  these  events  to  happen.  One  of  the  advantages  of  the 
models  of  uncertainty  introduced  in  this  thesis  is  that  they  provide  a  great  deal  of  flexibility 
to  interpolate  between  stochastic  and  adversarial  models  of  uncertainty.  As  a  first  example 
of  this  approach,  we  consider  MDPs  where  an  opponent  has  some  control  over  the  costs 
of  actions  taken  by  the  player  planning  in  the  MDP. 


MDPs  with  adversary-controlled  costs  While  MDPs  can  model  outcome  uncertainty, 
they  cannot  directly  model  domains  with  only  partially  observable  state,  unknown  dynam¬ 
ics,  or  the  presence  of  adversarial  or  cooperative  agents.  We  describe  several  new  algo¬ 
rithms  for  planning  in  more  general  environments  without  sacrificing  the  relative  computa¬ 
tional  tractability  of  standard  MDPs.  These  algorithms  efficiently  find  plans  that  minimize 
the  expected  total  worst-case  cost  of  reaching  a  goal  state  when  an  adversary  may  choose 
any  of  a  number  of  cost  models.  For  example,  we  can  find  a  solution  to  a  stochastic 
shortest  path  problem  that  minimizes  the  maximum  expected  cost  over  a  set  of  different 
edge-cost  scenarios.  A  novel  formulation  as  a  linear  program  (LP)  is  sufficient  to  show 
that  these  problems  can  be  solved  in  polynomial  time.  However,  experiments  demonstrate 
that  directly  solving  the  linear  program  is  too  slow  for  realistically-sized  problems,  even 
when  using  state-of-the-art  commercial  solvers.  To  address  this,  we  present  a  transfor¬ 
mation  that  allows  the  use  of  any  MDP  solver  as  a  subroutine,  producing  over  an  order 
of  magnitude  speedup.  We  describe  this  model  and  its  LP  formulation  in  Chapter  3,  and 
present  our  faster  algorithms  in  Chapter  5. 

^Even  if  we  have  no  way  of  probabilistically  quantifying  “extremely  unlikely.” 
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Extensive-form  games  Partially  observable  stochastic  games  (POSGs)  are  the  gold 
standard  for  modeling  uncertain  planning  problems.  They  can  directly  handle  most  types 
of  uncertainty:  partial  observability,  noisy  observations,  random  events,  uncertain  out¬ 
comes,  and  other  agents.  It  is  extremely  difficult  to  come  up  with  a  planning  problem  that 
can  not  be  modeled  in  some  way  as  a  POSG;  unfortunately,  it  is  equally  challenging  to  find 
realistic  problems  where  solving  any  POSG  representation  is  computationally  feasible. 

For  this  reason,  we  do  not  consider  fully  general  POSGs  in  this  thesis  work;  however, 
we  will  be  quite  concerned  with  the  closely  related  class  of  extensive-form  games.  EFGs 
can  model  all  of  the  types  of  uncertainty  that  can  be  modeled  by  POSGs,  with  an  important 
caveat:  while  POSGs  can  have  a  general  state  model  where  cycles  are  possible  (states  can 
be  revisited),  states  in  an  EFG  are  always  structured  in  a  directed  tree,  and  so  cycles  are 
not  possible  (no  state  is  ever  visited  twice  in  the  course  of  a  game).  Intuitively,  a  state  in 
an  EEG  corresponds  to  a  complete  history  of  past  actions  and  events  in  a  POSG.  Thus,  the 
size  of  the  game-tree  for  an  EEG  can  be  exponential  in  the  size  of  a  POSG  representation 
of  the  same  game. 

Why  constrain  ourselves  to  a  formulation  that  may  entail  an  exponential  penalty  in 
representation  size?  Because  EEGs  can  be  solved  in  polynomial  time  in  the  size  of  the 
game  tree  (in  the  perfect-recall,  zero-sum  case — we  fully  introduce  these  restrictions  in 
Chapter  3).  Rather  than  attack  the  provably  hard  problem  of  solving  general  POSGs,  we 
instead  look  to  leverage  the  relative  computational  tractability  of  EEGs.  We  do  this  in  two 
ways:  we  generalize  the  model  in  a  way  that  allows  many  interesting  games  to  be  modeled 
exponentially  more  compactly  than  was  previously  possible,  and  we  design  fast  algorithms 
that  solve  both  standard  EEGs  and  our  generalization  very  quickly. 


Convex  extensive-form  games  Convex  extensive-form  games  (CEEGs)  generalize 
EEGs  by  replacing  the  usual  small  set  of  discrete  actions  with  arbitrary  one-shot  two- 
player  convex  games — we  fully  describe  the  formulation  in  Chapter  4.  Under  some 
reasonable  assumptions,  CEEGs  can  be  solved  as  a  single  convex  optimization  problem 
of  polynomial- size  in  the  representation  of  the  game  tree.  This  model  can  represent  tra¬ 
ditional  extensive-form  and  matrix  (normal  form)  games  as  well  as  MDPs  and  MDPs 
with  adversary-controlled  costs.  (In  fact,  an  MDP  with  adversary  controlled  costs  can  be 
modeled  as  a  single  node  in  the  game  tree  of  a  CEEG,  as  can  an  arbitrary  matrix  game). 
The  central  advantage  of  the  CEEG  framework  is  that  it  can  provide  exponentially  smaller 
representations  of  many  interesting  planning  problems  and  sequential  games.  In  addition 
to  developing  the  theory  necessary  for  efficient  computation  with  CEEGs,  we  motivate 
the  work  by  providing  a  high-level  view  of  some  interesting  games  that  can  be  solved  or 
approximated  using  CEEGs. 
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Game  Class 

Abbreviation 

Reference 

matrix  (normal  form)  games 

- 

Section  3.1 

convex  games 

CG 

Section  3.1 

stochastic  (Markov)  games 

SG 

Section  3.5 

extensive-form  games 

EFG 

Section  3.2 

convex  stochastic  games 

CSG 

Section  3.5 

convex  extensive-form  games 

CEFG 

Section  4. 1 

Table  1.1:  Game  models  considered. 


Convex  games  Matrix  games,  extensive-form  games,  convex  extensive-form  games, 
MDPs,  and  MDPs  with  adversary-controlled  costs  are  all  instances  of  the  deceptively  sim¬ 
ple  yet  extremely  expressive  class  of  convex  games.  Though  the  concept  of  convex  games 
is  not  at  all  new  [see  Dresher  and  Karlin,  1953],  hopefully  this  thesis  will  help  highlight 
theoretical  and  algorithmic  properties  of  convex  games  that  make  the  framework  a  power¬ 
ful  tool  for  modem  computer  science.  We  will  fully  consider  the  class  of  convex  games  in 
Chapter  3. 

In  addition  to  the  examples  listed  above,  we  show  that  the  the  normal-form  stage  games 
in  a  stochastic  game  can  be  replaced  with  general  convex  games;  these  convex  stochastic 
games  (CSGs)  can  be  solved  in  the  discounted  case  by  minimax  value  iteration.  By  using 
the  convex  game  representation  of  an  extensive-form  game,  we  can  embed  EFGs  as  the 
stage  games  of  a  CSG,  creating  a  class  of  tractable  partially  observable  stochastic  games. 
The  key  feature  of  this  class  is  that  the  periods  of  partial  observability  are  of  bounded 
duration.  Figure  (1.1)  summarizes  the  types  of  games  considered  in  this  thesis,  along  with 
the  abbreviations  used  and  the  section  where  the  class  is  first  discussed. 


Algorithms  for  convex  games  For  convex  games,  it  is  typical  to  have  fast  best-response 
oracles  that  find  an  optimal  response  to  a  fixed  opponent  strategy.  This  is  the  case  for 
EFGs,  where  a  linear-time  dynamic  programming  algorithm  can  calculate  a  best  response; 
and  for  MDPs  with  adversary-controlled  costs,  where  standard  MDP  algorithms  can  be 
used  to  calculate  a  best  response. 

In  Chapter  5,  we  present  new  anytime  algorithms  for  solving  convex  games  that  lever¬ 
age  such  fast  best-response  oracles.  Our  principal  approach  is  to  use  oracles  for  both 
players  to  build  a  model  of  the  overall  game  that  is  used  to  identify  search  directions;  the 
algorithm  then  does  an  exact  minimization  in  this  direction  via  a  specialized  line  search. 

We  test  our  algorithms  on  both  a  simplified  version  of  Texas  Hold’em  poker  repre- 
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sented  as  an  extensive-form  game,  and  a  sensor-placement  /  observation-avoidance  game 
modeled  as  an  MDP  with  adversary  controlled  costs.  For  the  poker  game,  our  algorithm 
approximated  the  exact  value  of  this  game  within  $0.30  (the  maximum  pot  size  is  $310.00) 
in  a  little  over  2  hours,  using  less  than  1.5GB  of  memory;  finding  a  solution  with  compara¬ 
ble  bounds  using  a  state-of-the-art  interior-point  linear  programming  algorithm  took  over  4 
days  and  25GB  of  memory.  Our  algorithms  also  demonstrate  several  orders  of  magnitude 
better  performance  on  the  adversarial  MDP  problem. 


The  online  problem  MDPs  and  extensive-form  game  models  are  useful  when  we  have 
at  least  a  partial  model  (either  adversarial  or  stochastic)  of  the  environment  in  which  we 
wish  to  plan.  However,  such  models  are  not  easily  available  in  many  cases  of  interest. 
For  problems  where  such  a  model  is  not  available,  no-regret  algorithms  can  provide  a 
reasonable  framework  for  decision  making. 

Chapter  6  presents  an  algorithm  for  a  general  online  (repeated)  decision  problem, 
where  on  each  timestep  a  strategy  from  a  convex  set  must  be  chosen  without  knowledge 
of  the  current  cost  (objective)  function.  Our  algorithm  guarantees  performance  almost 
as  good  as  the  best  fixed  solution  in  hindsight,  while  making  no  assumptions  about  the 
nature  of  the  costs  in  the  world  and  receiving  information  only  about  the  outcomes  of 
the  decisions  actually  made,  not  potential  alternatives.  Previous  results  for  this  problem 
were  limited  to  oblivious  adversaries  that  make  all  decisions  in  advance;  the  algorithm  we 
discuss  was  the  first  to  also  handle  the  case  of  an  adaptive  adversary. 

This  algorithm  can  be  applied  to  a  wide  range  of  problems,  for  example,  it  can  be  used 
to  play  repeated  convex  games.  Against  a  fully  rational  opponent,  our  no-regret  algorithm 
will  asymptotically  perform  as  well  as  the  minimax  strategy,  and  against  an  arbitrary  ad¬ 
versary  the  algorithm  will  do  at  least  as  well  as  the  best  fixed  strategy  in  hindsight — which 
can  be  much  better  than  the  minimax  value  of  the  game  if  the  opponent  is  not,  in  fact,  fully 
adversarial. 
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Chapter  2 

Algorithms  for  Planning  in  Markov 
Decision  Processes 

2.1  Introduction 


Markov  decision  processes  (MDPs)  provide  a  framework  for  planning  in  domains  where 
actions  have  uncertain  outcomes.  MDPs  generalize  deterministic  planning  problems  like 
the  shortest  path  problem  that  can  be  solved  with  Dijkstra’s  algorithm  or  A*  as  well  as 
the  more  structured  deterministic  problems  tackled  by  AI  planning  [Cormen  et  ah,  1990, 
Russell  and  Norvig,  2003,  Weld,  1999,  Blum  and  Furst,  1997].  After  briefly  introducing 
the  MDP  framework,  we  present  two  lines  of  research  that  lead  to  fast  new  algorithms  for 
solving  MDPs. 

In  the  first,  we  study  the  problem  of  computing  the  optimal  value  function  for  a  Markov 
decision  process  with  positive  costs.  Computing  this  function  quickly  and  accurately  is  a 
basic  step  in  many  schemes  for  deciding  how  to  act  in  stochastic  environments.  There  are 
efficient  algorithms  which  compute  value  functions  for  special  types  of  MDPs:  for  deter¬ 
ministic  MDPs  with  S  states  and  A  actions,  Dijkstra’s  algorithm  runs  in  time  0{AS  log  S). 
And,  in  single-action  MDPs  (Markov  chains),  standard  linear- algebraic  algorithms  find  the 
value  function  in  time  0(5'^),  or  faster  by  taking  advantage  of  sparsity  or  good  condition¬ 
ing.  Algorithms  for  solving  general  MDPs  can  take  much  longer:  we  are  not  aware  of  any 
speed  guarantees  better  than  those  for  comparably-sized  linear  programs.  We  present  a 
family  of  algorithms  which  reduce  to  Dijkstra’s  algorithm  when  applied  to  deterministic 
MDPs,  and  to  standard  techniques  for  solving  linear  equations  when  applied  to  Markov 
chains.  More  importantly,  we  demonstrate  experimentally  that  these  algorithms  perform 
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well  when  applied  to  MDPs  which  “almost”  have  the  required  special  structure.  This  work 
was  originally  presented  in  [McMahan  and  Gordon,  2005a, b]. 

MDPs  for  real-world  problems  often  have  intractably  large  state  spaces.  In  the  second 
line  of  work  presented  in  this  chapter,  we  consider  solving  such  large  problems  when  only 
a  partial  policy  to  get  from  a  fixed  start  state  to  a  goal  is  needed.  In  this  situation,  restricting 
computation  to  states  relevant  to  this  task  can  make  much  larger  problems  tractable.  We 
introduce  a  new  algorithm.  Bounded  Real-Time  Dynamic  Programming  (BRTDP),  which 
can  produce  partial  policies  with  strong  performance  guarantees  while  only  touching  a 
fraction  of  the  state  space,  even  on  problems  where  other  algorithms  would  have  to  visit 
the  full  state  space.  To  do  this.  Bounded  RTDP  maintains  both  upper  and  lower  bounds 
on  the  optimal  value  function.  The  performance  of  Bounded  RTDP  is  greatly  aided  by  the 
introduction  of  a  new  technique  to  efficiently  find  suitable  upper  (pessimistic)  bounds  on 
the  value  function;  this  technique  can  also  be  used  to  provide  informed  initialization  to  a 
wide  range  of  other  planning  algorithms.  This  is  an  extended  treatment  of  the  research 
originally  described  in  [McMahan  et  ah,  2005]. 


2.2  Markov  Decision  Processes 

We  briefly  review  the  MDP  formulation  and  introduce  the  notation  we  will  use  for  the 
work  presented  in  this  chapter.  The  problem  of  finding  an  optimal  policy  in  an  MDP  can 
be  formulated  with  respect  to  several  possible  objective  functions:  expected  total  reward, 
expected  discounted  reward,  and  average  reward  [Puterman,  1994].  In  this  chapter,  we 
restrict  ourselves  to  the  expected  total  reward  criteria;  we  assume  non-negative  costs  and 
the  existence  of  an  absorbing  goal  state  to  ensure  a  finite  optimal  value  function.  This 
formulation  is  sometimes  called  the  stochastic  shortest  path  problem. 

We  represent  a  stochastic  shortest  path  problem  with  a  fixed  start  state  as  a  tuple  A4  = 
(S,  A,  P,  c,  s,  g),  where  S'  is  a  finite  set  of  states,  s  G  S'  is  the  start  state,  g  G  S'  is  the  goal 
state,  A  is  a  finite  action  set,  c  :  S  x  A  ^  ]R_i_  is  a  cost  function,  and  P  gives  the  dynamics; 
we  write  P^y  for  the  probability  of  reaching  state  y  when  executing  action  a  from  state  x. 
Since  g  is  an  absorbing  goal  state  we  have  c(g,  a)  =  0  and  P® g  =  1  for  all  actions  a.  The 
set  succ(a;,  a)  =  {y  e  S  \  P^y  >  0,  ?/  7^  g}  contains  all  possible  possible  successors  of 
state  X  under  action  a,  except  that  the  goal  state  is  always  excluded.  Similarly,  pred(a;)  is 
the  set  of  all  state-action  pairs  (y,  h)  such  that  taking  action  b  from  state  y  has  a  positive 
chance  of  reaching  state  x. 

A  stationary  policy  is  a  function  tt  :  S  ^  A.  A  policy  is  proper  if  an  agent  following 
it  from  any  state  will  eventually  reach  the  goal  with  probability  1.  We  make  the  standard 
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assumption  that  at  least  one  proper  policy  exists  fox  M,  and  that  all  improper  policies  have 
infinite  expected  total  cost  at  some  state  [Bertsekas  and  Tsitsiklis,  1996].  For  a  proper 
policy  TT,  we  define  the  value  function  of  vr  as  the  solution  to  the  set  of  linear  equations: 

v^{x)  =  c(a;,7r(a;))  + 

y&S 

It  is  straightforward  to  verify  that  w,r(a:)  is  exactly  the  expected  cost  of  reaching  the  goal 
by  following  tt. 

ifu  e  is  an  arbitrary  assignment  of  values  to  states,  we  define  Q  values  with 
respect  to  v  by 

Qy{x,  a)  =  c(x,  a)  +  ^  PxyV{y)- 

y&S 

It  is  well-known  that  there  exists  an  optimal  (minimal)  value  function  v*,  and  it  satisfies 
Bellman’s  equations  at  all  non-goal  states  x  and  for  all  actions  a: 

v*{x)  =  rmnQ*{x,a) 

a£A 

Q*{x,a)  =  c{x,a)+  ^  PxyPiv)- 

y£succ{x,a) 

We  can  write  these  equations  more  compactly  as  v*{x)  =  minaeA  Qv*(x,  a).  For  an  arbi¬ 
trary  V,  we  define  the  (signed)  Bellman  error  of  vdXxhy  bet,(a;)  =  w(a;)— min^eA  Qv{x,  a), 
the  difference  between  the  left-hand-side  and  right-hand- side  of  the  Bellman  equations.  A 
greedy  policy  with  respect  to  some  value  function  v,  greedy(r;),  is  a  policy  that  satisfies 

greedy(t;,  x)  G  a,TgmmQv{x,a). 

aeA 

We  say  a  policy  tt  is  optimal  if  v.,^{x)  =  v*  (x)  for  all  x.  It  follows  that  a  greedy  policy  with 
respect  to  v*  is  optimal.  In  this  way,  the  problem  of  finding  an  optimal  policy  reduces  to 
that  of  finding  the  optimal  value  function. 

To  simplify  notation,  we  have  omitted  the  possibility  of  discounting.  A  discount  factor 
7  can  be  introduced  indirectly,  however,  by  reducing  by  a  factor  of  7  for  all  y  7^  g  and 
increasing  P“g  accordingly,  where  g  is  the  absorbing  goal  state. 

For  more  details  on  the  MDP  formulation,  consult  Puterman  [1994]  or  Bertsekas 
[1995].  We  now  turn  to  algorithms  for  the  stochastic  shortest  path  problem.  Note  that 
we  do  not  assume  the  existence  of  a  fixed  start  state  s  in  Section  2.3,  but  this  assumption 
will  be  central  to  Bounded  Real-Time  Dynamic  Programming,  introduced  in  Section  2.4. 
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2.3  Approaches  based  on  Prioritization 
and  Policy  Evaluation 

Many  algorithms  for  planning  in  Markov  decision  processes  work  by  maintaining  esti¬ 
mates  V  and  Q  of  v*  and  Q* ,  and  repeatedly  updating  the  estimates  to  reduce  the  difference 
between  the  two  sides  of  the  Bellman  equations  (called  the  Bellman  error).  For  example, 
value  iteration  (VI)  repeatedly  loops  through  all  states  x  performing  backup  operations  at 
each  one: 

for  all  actions  a  do 

Q{x,  a)  ^  c{x,  a)  +  Eyesucc(x,a)  PSyV{y) 

end  for 

v{x)  ^  minag^  Q{x,a) 

On  the  other  hand,  Dijkstra’s  algorithm  carefully  schedules'  expansion  operations  at 
each  state  x  instead: 

v{x)  ^  minagA  Q{x,a) 

for  all  (?/,  h)  e  pred(a;)  do 

Q{y,  h)  ^  c{y,  b)  +  Ex'6succ(y,&)  Pyx'<x') 

end  for 

For  good  recent  references  on  value  iteration  and  Dijkstra’s  algorithm,  see  [Bertsekas, 
1995]  and  [Cormen  et  ah,  1990]. 

Any  sequence  of  backups  or  expansions  is  guaranteed  to  make  v  and  Q  converge  to 
the  optimal  v*  and  Q*  so  long  as  we  visit  each  state  infinitely  often.  Of  course,  some 
sequences  will  converge  much  more  quickly  than  others.  A  wide  variety  of  algorithms 
have  attempted  to  find  good  state-visitation  orders  to  ensure  fast  convergence.  For  ex¬ 
ample,  Dijkstra’s  algorithm  is  guaranteed  to  find  an  optimal  ordering  for  a  deterministic 
positive-cost  MDP;  for  general  MDPs,  algorithms  like  prioritized  sweeping  [Moore  and 
Atkeson,  1993],  generalized  prioritized  sweeping  [Andre  et  ah,  1998],  RTDP  [Barto  et  ah, 
1995],  LRTDP  [Bonet  and  Geffner,  2003b],  Focussed  Dynamic  Programming  [Ferguson 
and  Stentz,  2004],  and  HDP  [Bonet  and  Geffner,  2003a]  all  attempt  to  compute  good  or¬ 
derings. 

Algorithms  based  on  backups  or  expansions  have  an  important  disadvantage,  though: 
they  can  be  slow  at  policy  evaluation  in  MDPs  with  even  a  few  stochastic  transitions.  For 

'We  deffer  a  full  discussion  of  Dijkstra’s  algorithm  to  Section  2.3.1. 
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Figure  2.1:  A  Markov  chain  for  which  backup-based  methods  converge  slowly.  Each 
action  costs  1 . 


example,  in  the  Markov  chain  of  Figure  2.1  (which  has  only  one  stochastic  transition),  the 
best  possible  ordering  for  value  iteration  will  only  reduce  Bellman  error  by  1%  with  each 
five  backups.  To  find  the  optimal  value  function  quickly  for  this  chain  (or  for  an  MDP 
which  contains  it),  we  turn  instead  to  methods  which  solve  systems  of  linear  equations. 

The  policy  iteration  algorithm  alternates  between  steps  of  policy  evaluation  and  policy 
improvement.  If  we  fix  an  arbitrary  policy  and  temporarily  ignore  all  off-policy  actions,  the 
Bellman  equations  become  linear.  We  can  solve  this  set  of  linear  equations  to  evaluate  our 
policy,  and  set  v  to  be  the  resulting  value  function.  Given  v,  we  compute  a  greedy  policy 
TT  =  greedy(t;).  Fixing  this  greedy  policy  gives  another  set  of  linear  equations,  which 
can  be  solved  to  compute  an  improved  policy.  Policy  iteration  is  guaranteed  to  converge 
so  long  as  the  initial  policy  has  a  finite  value  function.  Within  the  policy  evaluation  step 
of  policy  iteration  methods,  we  can  choose  any  of  several  ways  to  solve  our  set  of  linear 
equations  [Press  et  ah,  1992].  For  example,  we  can  use  Gaussian  elimination,  sparse 
Gaussian  elimination,  or  biconjugate  gradients  with  any  of  a  variety  of  preconditioners. 
We  can  even  use  value  iteration,  although  as  mentioned  above  value  iteration  may  be  a 
slow  way  to  solve  the  Bellman  equations  when  we  are  evaluating  a  fixed  policy. 

Of  the  algorithms  discussed  above,  no  single  one  is  fast  at  solving  all  types  of  Markov 
decision  process.  Backup-based  and  expansion-based  methods  work  well  when  the  MDP 
has  short  or  nearly  deterministic  paths  with  little  chance  of  cycling,  but  can  converge 
slowly  in  the  presence  of  noise  and  cycles.  On  the  other  hand,  policy  iteration  evaluates 
each  policy  quickly,  but  may  spend  work  evaluating  a  policy  even  after  it  has  become 
obvious  that  another  policy  is  better. 

This  section  describes  three  new  algorithms  which  blend  features  of  Dijkstra’s  algo¬ 
rithm,  value  iteration,  and  policy  iteration.  In  Section  2.3.1,  we  describe  Improved  Pri¬ 
oritized  Sweeping.  IPS  reduces  to  Dijkstra’s  algorithm  when  given  a  deterministic  MDP, 
but  also  works  well  on  MDPs  with  stochastic  outcomes.  In  Section  2.3.2,  we  develop 
Prioritized  Policy  Iteration,  by  extending  IPS  by  incorporating  policy  evaluation  steps. 
Section  2.3.3  describes  Gauss-Dijkstra  Elimination  (GDE),  which  interleaves  policy  eval¬ 
uation  and  prioritized  scheduling  more  tightly.  GDE  reduces  to  Dijkstra’s  algorithm  for 
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main(): 

queue,  clear  0 
(Vx)  closed(a;)  ^  false 
(Vx)  v{x)  ^  M 
(Vx,  a)  Q{x,  a)  ^  M 
(Va)  (5(goal,  a)  ^  0 
closed  (goal)  ^  true 
(Vx)  7r(a;)  undefined 
7r(goal)  =  arbitrary 
update  (goal) 

while  (not  queue.isempty())  do 
X  ^  queue.popO 
closed(a;)  ^  true 
update(a;) 

end  while 


update(a;): 

v{x)  ^  Q{x,t^{x)) 
for  all  iy^h)  G  pred(a;)  do 

QoU  ^  Q{y,  T^{y))  (or  M  if  n{y)  undefined) 

Q{y,  h)  ^  c{y,  h)  +  Ex'6succ(y,6) 
if  ( (not  closed(|/))  and  Q{y,  b)  <  Qoid) )  then 
pii  ^Q{y,b)  {*) 

Tr{y)  ^  b 

queue.decreasepriority(?/,  pri) 

end  if 
end  for 


Figure  2.2:  Dijkstra’s  algorithm,  in  a  notation  which  will  allow  us  to  generalize  it  to 
stochastic  MDPs.  The  variable  “queue”  is  a  priority  queue  which  returns  the  smallest 
of  its  elements  each  time  it  is  popped.  The  constant  M  is  an  upper  bound  on  the  value  of 
(distance  to)  any  state. 

deterministic  MDPs,  and  to  Gaussian  elimination  for  policy  evaluation.  In  Section  2.3.5, 
we  experimentally  demonstrate  that  these  algorithms  extend  the  advantages  of  Dijkstra’s 
algorithm  to  “mostly”  deterministic  MDPs,  and  that  the  policy  evaluation  performed  by 
PPI  and  GDE  speeds  convergence  on  problems  where  backups  alone  would  be  slow. 


2.3.1  Improved  Prioritized  Sweeping 

Dijkstra’s  algorithm  is  shown  in  Figure  2.2.  Its  basic  idea  is  to  keep  states  on  a  priority 
queue,  sorted  by  how  urgent  it  is  to  expand  them.  The  priority  queue  is  assumed  to  support 
operations  queue.pop(),  which  removes  and  returns  the  queue  element  with  numerically 
lowest  priority;  queue.decreasepriority(a;,  p),  which  puts  x  on  the  queue  if  it  wasn’t  there, 
or  if  it  was  there  with  priority  >  p  sets  its  priority  to  p,  or  if  it  was  there  with  priority  <  p 
does  nothing;  and  queue.clear(),  which  empties  the  queue. 

In  deterministic  Markov  decision  processes  with  positive  costs,  it  is  always  possible 
to  find  a  new  state  x  to  expand  whose  value  we  can  set  to  v*{x)  immediately.  So,  in 
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Figure  2.3:  An  MDP  whose  best  state  ordering  is  impossible  to  determine  using  only  local 
properties  of  the  states.  Arcs  which  split  correspond  to  actions  with  stochastic  outcomes; 
for  example,  taking  action  b  from  state  1  reaches  G  with  probability  0.5  and  2  with  proba¬ 
bility  0.5. 


these  MDPs,  Dijkstra’s  algorithm  touches  each  state  only  once  while  computing  v*,  and  is 
therefore  by  far  the  fastest  way  to  find  a  complete  policy. 

An  optimal  ordering  for  backups  or  expansions  is  an  ordering  of  the  states  such  that 
for  all  states  x,  the  value  v*{x)  can  be  determined  using  only  v*{y)  for  states  y  which 
come  before  x  in  the  ordering.  In  MDPs  with  stochastic  outcomes,  there  need  not  exist  an 
optimal  ordering.  Even  if  there  exists  such  an  ordering  (i.e.,  if  there  is  an  acyclic  optimal 
policy),  we  might  need  to  look  at  non-local  properties  of  states  to  find  it:  Figure  2.3  shows 
an  MDP  with  four  non-goal  states  (numbered  1-4)  and  two  actions  (a  and  h).  In  this  MDP, 
the  optimal  policy  is  acyclic  with  ordering  G3214.  But,  after  expanding  the  goal  state, 
there  is  no  way  to  tell  which  of  states  1  and  3  to  expand  next:  both  have  one  deterministic 
action  which  reaches  the  goal,  and  one  stochastic  action  that  reaches  the  goal  half  the  time 
and  an  unexplored  state  half  the  time.  If  we  expand  either  one  we  will  set  its  policy  to 
action  a  and  its  value  to  10;  if  we  happen  to  choose  state  3  we  will  be  correct,  but  the 
optimal  action  from  state  1  is  &  and  t:*(l)  =  13/2  <  10. 

Several  algorithms,  most  notably  prioritized  sweeping  [Moore  and  Atkeson,  1993]  and 
generalized  prioritized  sweeping  [Andre  et  ah,  1998],  have  attempted  to  extend  the  priority 
queue  idea  to  MDPs  with  stochastic  outcomes.  These  algorithms  give  up  the  property  of 
visiting  each  state  only  once  in  exchange  for  solving  a  larger  class  of  MDPs.  However, 
neither  of  these  algorithms  reduce  to  Dijkstra’s  algorithm  if  the  input  MDP  happens  to  be 
deterministic.  Therefore,  they  potentially  take  far  longer  to  solve  a  deterministic  or  nearly- 
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deterministic  MDP  than  they  need  to.  In  the  next  section,  we  discuss  what  properties 
an  expansion-scheduling  algorithm  needs  to  have  to  reduce  to  Dijkstra’s  algorithm  on 
deterministic  MDPs. 


Generalizing  Dijkstra 

We  will  consider  algorithms  which  replace  the  line  (*)  in  Figure  2.2  by  other  priority 
calculations  that  maintain  the  property  that  when  the  input  MDP  is  deterministic  with 
positive  edge  costs  an  optimal  ordering  is  produced.  If  the  input  MDP  is  stochastic,  a  single 
pass  of  a  generalized  Dijkstra  algorithm  generally  will  not  compute  v*,  so  we  will  have  to 
run  multiple  passes.  Each  subsequent  pass  can  start  from  the  value  function  computed  by 
the  previous  pass  (instead  of  from  v{x)  =  M  like  the  first  pass),  so  multiple  passes  will 
cause  V  to  converge  to  v*.  (Likewise,  we  can  save  Q  values  from  pass  to  pass.)  We  now 
consider  several  priority  calculations  that  satisfy  the  desired  property. 


Large  Change  in  Value  The  simplest  statistic  which  allows  us  to  identify  completely- 
determined  states,  and  the  one  most  similar  in  spirit  to  prioritized  sweeping,  is  how  much 
the  state’s  value  will  change  when  we  expand  it.  In  line  (*)  of  Figure  2.2,  suppose  that  we 
set 

pii  ^  d{v{y) -Q{y,b))  (2.1) 

for  some  monotone  decreasing  function  d  ;  M  — >  M.  Any  state  y  with  closed(|/)  =  false 
(called  an  open  state)  will  have  v{y)  =  M  in  the  first  pass,  while  closed  states  will  have 
lower  values  of  v{y).  So,  any  deterministic  action  leading  to  a  closed  state  will  have  lower 
Q{y,b)  than  any  action  which  might  lead  to  an  open  state.  And,  any  open  state  y  which  has 
a  deterministic  action  b  leading  to  a  closed  state  will  be  on  our  queue  with  priority  at  most 
d{v{y)  —  Q{y,  b))  =  d{M  —  Q{y,  b)).  So,  if  our  MDP  contains  only  deterministic  actions, 
the  state  at  the  head  of  the  queue  will  the  open  state  with  the  smallest  Q{y^  b) — identical 
to  Dijkstra’s  algorithm. 

Note  that  prioritized  sweeping  and  generalized  prioritized  sweeping  perform  backups 
rather  than  expansions,  and  use  a  different  estimates  of  how  much  a  state’s  value  will 
change  when  updated.  Namely,  they  keep  track  of  how  much  a  state’s  successors’  values 
have  changed  and  base  their  priorities  on  these  changes  weighted  by  the  corresponding 
transition  probabilities.  This  approach,  while  in  the  spirit  of  Dijkstra’s  algorithm,  does 
not  reduce  to  Dijkstra’s  algorithm  when  applied  to  deterministic  MDPs.  Wiering  [1999] 
discusses  the  priority  function  (2.1),  but  he  does  not  prescribe  the  uniform  pessimistic  ini¬ 
tialization  of  the  value  function  which  is  given  in  Figure  2.2.  This  pessimistic  initialization 
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is  necessary  to  make  (2.1)  reduce  to  Dijkstra’s  algorithm.  Other  authors  (for  example  Diet- 
terich  and  Flann  [1995])  have  discussed  pessimistic  initialization  for  prioritized  sweeping, 
but  only  in  the  context  of  the  original  non-Dijkstra  priority  scheme  for  that  algorithm. 

One  problem  with  the  priority  scheme  of  equation  (2.1)  is  that  it  only  reduces  to  Di¬ 
jkstra’s  algorithm  if  we  uniformly  initialize  v{x)  ^  M  for  all  x.  If  instead  we  pass  in 
some  nonuniform  u(a;)  >  v*  (x)  (such  as  one  which  we  computed  in  a  previous  pass  of  our 
algorithm,  or  one  we  got  by  evaluating  a  policy  provided  by  a  domain  expert),  we  may  not 
expand  states  in  the  correct  order  in  a  deterministic  MDP.^  This  property  is  somewhat  un¬ 
fortunate:  by  providing  stronger  initial  bounds,  we  may  cause  our  algorithm  to  run  longer. 
So,  in  the  next  few  subsections  we  will  investigate  additional  priority  schemes  which  can 
help  alleviate  this  problem. 


Low  Upper  Bound  on  Value  Another  statistic  which  allows  us  to  identify  completely- 
determined  states  X  in  Dijkstra’s  algorithm  is  an  upper  bound  on  v*{x).  If,  in  line  (*)  of 
Figure  2.2,  we  set 

pri  ^  m{Q{y,b))  (2.2) 

for  some  monotone  increasing  function  m(-),  then  any  open  state  y  which  has  a  determinis¬ 
tic  action  h  leading  to  a  closed  state  will  be  on  our  queue  with  priority  at  most  m{Q{y,  b)). 
(Note  that  Q{y,  b)  is  an  upper  bound  on  v*{y)  because  we  have  initialized  v{x)  ^  M  for 
all  X.)  As  before,  in  a  deterministic  MDP,  the  head  of  the  queue  will  be  the  open  state  with 
smallest  Q{y,  b).  But,  unlike  before,  this  fact  holds  no  matter  how  we  initialize  v  (so  long 
as  v{x)  >  v*{x))\  in  a  deterministic  positive-cost  MDP,  it  is  always  safe  to  expand  the 
open  state  with  the  lowest  upper  bound  on  its  value. 


High  Probability  of  Reaching  Goal  Dijkstra’s  algorithm  can  also  be  viewed  as  building 
a  set  of  closed  states,  whose  v*  values  are  completely  known,  by  starting  from  the  goal  state 
and  expanding  outward.  According  to  this  intuition,  we  should  consider  maintaining  an 
estimate  of  how  well-known  the  values  of  our  states  are,  and  adding  the  best-known  states 
to  our  closed  set  first. 

^We  need  to  be  careful  passing  in  arbitrary  v{x)  vectors  for  initialization;  if  there  are  any  optimal  but 
underconsistent  states  (states  whose  v{x)  is  already  equal  to  v*{x),  but  whose  v{x)  is  less  than  the  right- 
hand  side  of  the  Bellman  equation),  then  the  check  Q{y,  b)  <  v{y)  will  prevent  us  from  pushing  them  on 
the  queue  even  though  their  predecessors  may  be  inconsistent.  So,  such  an  initialization  for  v  may  cause 
our  algorithm  to  terminate  prematurely  before  v  =  v*  everywhere.  Fortunately,  if  we  initialize  using  a  v 
computed  from  a  previous  pass  of  our  algorithm,  or  set  v  to  the  value  of  some  policy,  then  there  will  be  no 
optimal  but  underconsistent  states,  so  this  problem  will  not  arise. 
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For  this  purpose,  we  can  add  extra  variables  Pgoaiix,  a)  for  all  states  x  and  actions  a, 
initialized  to  0  if  a;  is  a  non-goal  state  and  1  if  a;  is  a  goal  state.  Let  us  also  add  variables 
Pgoai(a^)  for  all  states  x,  again  initialized  to  0  if  a;  is  a  non-goal  state  and  1  if  a;  is  a  goal 
state. 

To  maintain  the  pgoai  variables,  each  time  we  update  Q{y,  b)  we  can  set 

PgaaliV^b)  <  ^  ^  Pyx'Pgoa\{x) 

x'^succ(yfi) 

And,  when  we  assign  v{x)  ^  Q{x,  a)  we  can  set 


Pgoal(2^)  ^  Pgoal(2^) 

(in  this  case,  we  will  call  a  the  selected  action  from  x).  With  these  definitions,  Pgoai(a^)  will 
always  remain  equal  to  the  probability  of  reaching  the  goal  from  x  by  following  selected 
actions  and  at  each  step  moving  from  a  state  expanded  later  to  one  expanded  earlier  (we 
call  such  a  path  a  decreasing  path).  In  other  words,  Pgoai(a^)  tells  us  what  fraction  of  our 
current  estimate  v{x)  is  based  on  fully-examined  paths  which  reach  the  goal. 

In  a  deterministic  MDP,  pgoai  will  always  be  either  0  or  1:  it  will  be  0  for  open  states, 
and  1  for  closed  states.  Since  Dijkstra’s  algorithm  never  expands  a  closed  state,  we  can 
combine  any  decreasing  function  of  Pgoa\{x)  with  any  of  the  above  priority  functions  with¬ 
out  losing  our  equivalence  to  Dijkstra.  For  example,  we  could  use 

pri  ^  m{Q{y,  &),  1  -  Pgoai(f/))  (2.3) 

where  m  is  a  two-argument  monotone  function.^ 

In  the  first  sweep  after  we  initialize  v{x)  ^  M,  priority  scheme  (2.3)  is  essentially 
equivalent  to  schemes  (2.1)  and  (2.2):  the  value  Q{x,  a)  can  be  split  up  as 

Pgoa\{x,  a)QD{x,  a)  +  {l-  Pgoai{x,  a))M 

where  Qd  (a:,  a)  is  the  expected  cost  to  reach  the  goal  assuming  that  we  follow  a  decreasing 
path.  That  means  that  a  fraction  1  —  Pgoai(a:,  a)  of  the  value  Q{x,  a)  will  be  determined 
by  the  large  constant  M,  so  state-action  pairs  with  higher  Pgoai(a:,  a)  values  will  almost 
always  have  lower  Q{x.,  a)  values.  However,  if  we  have  initialized  v{x)  in  some  other 
way,  then  equation  (2.1)  no  longer  reduces  to  Dijkstra’s  algorithm,  while  equations  (2.2) 
and  (2.3)  are  different  but  both  reduce  to  Dijkstra’s  algorithm  on  deterministic  MDPs. 

monotone  function  with  multiple  arguments  is  one  which  always  increases  when  we  increase  one  of 
the  arguments  while  holding  the  others  fixed. 


16 


This  general  technique  can  be  thought  of  as  tracking  the  probability  of  reaching  the 
goal  (versus  reaching  a  history  where  no  action  is  specified)  under  a  particular  non¬ 
stationary  partial  policy.  In  addition  to  providing  a  method  for  scheduling  in  our  gen¬ 
eralizations  of  Dijkstra’s  algorithm,  we  will  use  a  similar  approach  to  help  schedule 
row-elimination  operations  in  an  application  of  Gaussian  elimination  to  solving  MDPs 
(Section  (2.3.3)),  as  well  as  to  produce  high-quality  upper  bounds  on  the  optimal  value 
function  in  order  to  initialize  the  Bounded  RTDP  algorithm  (Section  (2.4.2)). 


All  of  the  Above  Instead  of  restricting  ourselves  to  just  one  of  the  priority  functions 
mentioned  above,  we  can  combine  all  of  them:  since  the  best  states  to  expand  in  a  deter¬ 
ministic  MDP  will  win  on  any  one  of  the  above  criteria,  we  can  use  any  monotone  function 
of  all  of  the  criteria  and  still  behave  like  Dijkstra  in  deterministic  MDPs.  For  example,  we 
can  take  the  sum  of  two  of  the  priority  functions,  or  the  product  of  two  positive  prior¬ 
ity  functions;  or,  we  can  use  one  of  the  priorities  as  the  primary  sort  key  and  break  ties 
according  to  a  different  one. 

We  have  experimented  with  several  different  combinations  of  priority  functions;  the 
experimental  results  we  report  use  the  priority  functions 


pri]^(a;,  a) 


Q(x,  a)  —  v{x) 
Q{x,  a)  -f  1 


(2.4) 


and 

pri2(a;,a)  =  (1  -pgoai(a:),  prii(a;,a))  (2.5) 

The  pri^  function  combines  the  value  change  criterion  (2.1)  with  the  upper  bound  crite¬ 
rion  (2.2).  It  is  always  negative  or  zero,  since  0  <  Q{x,a)  <  v{x).  It  decreases  when 
the  value  change  increases  (since  1/Q{x,a)  is  positive),  and  it  increases  as  the  upper 
bound  increases  (since  Xjx  is  a  monotone  decreasing  function  when  a;  >  0,  and  since 

Q{x^  a)  —  v{x)  <  0). 

The  pri2  function  uses  Pgoai  as  a  primary  sort  key  and  breaks  ties  according  to  pri^. 
That  is,  pri2  returns  a  vector  in  which  should  be  compared  according  to  lexical  ordering 
(e.g.,  (3,3)  <(4,2)  <(4,3)). 


Sweeps  vs.  Multiple  Updates 

The  algorithms  we  have  described  so  far  in  this  section  must  update  every  state  once 
before  updating  any  state  twice.  We  can  also  consider  a  version  of  the  algorithm  which 
does  not  enforce  this  restriction;  this  multiple- update  algorithm  simply  skips  the  check  “if 
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not  closed(|/)”  which  ensures  that  we  don’t  push  a  previously-closed  state  onto  the  priority 
queue.  The  multiple-update  algorithm  still  reduces  to  Dijkstra’s  algorithm  when  applied  to 
a  deterministic  MDP:  any  state  which  is  already  closed  will  fail  the  check  Q{y^  h)  <  v{y) 
for  all  subsequent  attempts  to  place  it  on  the  priority  queue. 

Experimentally,  the  multiple-update  algorithm  is  faster  than  the  algorithm  which  must 
sweep  through  every  state  once  before  revisiting  any  state.  Intuitively,  the  sweeping  algo¬ 
rithm  can  waste  a  lot  of  work  at  states  far  from  the  goal  before  it  determines  the  optimal 
values  of  states  near  the  goal. 

In  the  multiple- update  algorithm  we  are  always  effectively  in  our  “first  sweep,”  and  so 
since  we  initialize  uniformly  to  a  large  constant  M  we  can  reduce  to  Dijkstra’s  algorithm 
by  using  priority  pri^  from  equation  (2.4).  The  resulting  algorithm  is  called  Improved 
Prioritized  Sweeping;  its  update  method  is  listed  in  Figure  2.4. 

As  is  typical  for  value-function  based  methods,  we  declare  convergence  when  the  max¬ 
imum  Bellman  error  (over  all  states)  drops  below  some  preset  limit  e.  This  is  implemented 
in  IPS  by  an  extra  check  that  ensures  all  states  on  the  priority  queue  have  Bellman  error 
at  least  e;  when  the  queue  is  empty  it  is  easy  to  show  that  no  such  states  remain.  Similar 
methods  are  used  for  our  other  algorithms. 

2.3.2  Prioritized  Policy  Iteration 

The  Improved  Prioritized  Sweeping  algorithm  works  well  on  MDPs  which  are  moder¬ 
ately  close  to  being  deterministic.  Once  we  start  to  see  large  groups  of  states  with  strongly 
interdependent  values,  there  will  be  no  expansion  order  which  will  allow  us  to  find  a  good 
approximation  to  v*  in  a  small  number  of  visits  to  each  state.  The  MDP  of  Figure  2. 1  is 
an  example  of  this  problem:  because  there  is  a  cycle  which  has  high  probability  and  visits 
a  significant  fraction  of  the  states,  the  values  of  the  states  along  the  cycle  depend  strongly 
on  each  other. 

To  avoid  having  to  expand  states  repeatedly  to  incorporate  the  effect  of  cycles,  we  will 
turn  to  algorithms  that  occasionally  do  some  work  to  evaluate  the  current  policy.  When 
they  do  so,  they  will  temporarily  fix  the  current  actions  to  make  the  value  determination 
problem  linear.  The  simplest  such  algorithm  is  policy  iteration,  which  alternates  between 
complete  policy  evaluation  (which  solves  an  S'  x  S'  system  of  linear  equations  in  an  S'- 
state  MDP)  and  greedy  policy  improvement  (which  picks  the  action  which  achieves  the 
minimum  on  the  right-hand  side  of  Bellman’s  equation  at  each  state). 

We  will  describe  two  algorithms  which  build  on  policy  iteration.  The  first  algorithm, 
called  Prioritized  Policy  Iteration,  is  the  subject  of  the  current  section.  PPI  attempts  to 
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update(a;): 

v{x)  ^  Q{x,t^{x)) 
for  all  (i/,  h)  G  pred(a;)  do 

Qoid  ^  Q{y,  T^{y))  (or  M  if  n{y)  undefined) 
Q{y,  h)  ^  c{y,  b)  +  Ex'esucc(y,fe)  Pyx'Qi^’,  7r(a;')) 
if  (Q{y,  b)  <  Qm)  then 

pri  ^  {Q{y,  b)  -  v{y))/{Q{y,  b)  +  1) 

■niij)  ^  b 

-Q{y,b)\  >  e)then 
queue. deereasepriority(|/,  pri) 

end  if 
end  if 
end  for 


Figure  2.4:  The  update  funetion  for  the  Improved  Prioritized  Sweeping  algorithm.  The 
main  funetion  is  the  same  as  for  Dijkstra’s  algorithm.  As  before,  “queue”  is  a  priority 
min-queue  and  M  is  a  very  large  positive  number. 


improve  on  poliey  iteration’s  greedy  poliey  improvement  step,  doing  a  small  amount  of 
extra  work  during  this  step  to  try  to  reduee  the  number  of  poliey  evaluation  steps.  Sinee 
poliey  evaluation  is  usually  mueh  more  expensive  than  poliey  improvement,  any  reduetion 
in  the  number  of  evaluation  steps  will  usually  result  in  a  better  total  planning  time.  The 
seeond  algorithm,  whieh  we  will  deseribe  in  the  Seetion  2.3.3,  tries  to  interleave  poliey 
evaluation  and  poliey  improvement  on  a  finer  seale  to  provide  more  aeeurate  Q  and  Pgoai 
estimates  for  pieking  aetions  and  ealeulating  priorities  on  the  fringe. 

Pseudo-eode  for  PPI  is  given  in  Figure  2.5.  The  main  loop  is  identieal  to  regular  pol¬ 
iey  iteration,  exeept  for  a  eall  to  sweep  ( )  rather  than  to  a  greedy  poliey  improvement 
routine.  The  poliey  evaluation  step  ean  be  implemented  effieiently  by  a  eall  to  a  sophisti- 
eated  linear  solver;  such  a  solver  can  take  advantage  of  sparsity  in  the  transition  dynamics 
by  constructing  an  explicit  LU  factorization  [Duff  et  ah,  1986],  or  it  can  take  advantage 
of  good  conditioning  by  using  an  iterative  method  such  as  stabilized  biconjugate  gradi¬ 
ents  [Barrett  et  ah,  1994].  In  either  case,  we  can  expect  to  be  able  to  evaluate  policies 
efficiently  even  in  large  Markov  decision  processes. 

The  policy  improvement  step  is  where  we  hope  to  beat  policy  iteration.  By  performing 
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main(): 

(Vx)  v{x)  ^  M,  Xoid(x)  ^  M 
x(goal)  ^  0,  Xoid(goal)  ^  0 
while  (true)  do 

(Vx)  7r(x)  <—  undefined 

A  ^  0 

sweepO 

if  (A  <  toleranee)  then 
deelare  eonvergenee 

end  if 

(Vx)  XoldCx)  ^  v{x) 

V  <—  evaluate  poliey  7r(x) 

end  while 

sweepO: 

(Vx)  elosed(x)  ^  false 
(Vx)  Pgoalix)  ^  0 
elosed(goal)  ^  true 
update(goal) 

while  (not  queue.isempty())  do 
X  <—  queue.popO 
elosed(x)  <—  true 
update  (x) 

end  while 


update(x): 

for  (all  {{y,  a)  G  pred(x))  do 
if  (elosed(?/))  then 

Q(y,  a)  ^  C(y,  a)  +  Zx'€succ(y,a) 

A  ^  max(A,  v(y)  -  Q{y,  a)) 

else 

for  all  aetions  b  do 

Qoid  ^  Q{y,  7r(y))  (or  M  if  7r(y)  undefined) 

Q(y,  b)  ^  C{y,  b)  +  Ex'6succ(?;,fe) 

PgaaliUi  b)  <  X^a;'gsucc(y,b)  ^yg 

if  {Q{y,  b)  <  Qoid)  then 
v{y)  ^  Q{y,b) 

Tx{y)  b 

Pgoaxiy)  *■  PgoslijJi  b) 

pri  ^  (1  -pgoai(a:),  {v{y)  -  Void{y)) /v{y)) 
queue.deereasepriority(y,  pri) 

end  if 
end  for 
end  if 
end  for 


Figure  2.5:  The  Prioritized  Policy  Iteration  algorithm.  As  before,  “queue”  is  a  priority 
min-queue  and  M  is  a  very  large  positive  number. 


a  prioritized  sweep  through  state  space,  so  that  we  examine  states  near  the  goal  before 
states  farther  away,  we  can  base  many  of  our  policy  decisions  on  multiple  steps  of  look¬ 
ahead.  Scheduling  the  expansions  in  our  sweep  according  to  one  of  the  priority  functions 
previously  discussed  insures  PPI  reduces  to  Dijkstra’s  algorithm:  when  we  run  it  on  a 
deterministic  MDP,  the  first  sweep  will  compute  an  optimal  policy  and  value  function, 
and  will  never  encounter  a  Bellman  error  in  a  closed  state.  So  A  will  be  0  at  the  end  of 
the  sweep,  and  we  will  pass  the  convergence  test  before  evaluating  a  single  policy.  On 
the  other  hand,  if  there  are  no  action  choices  then  PPI  will  not  be  much  more  expensive 
than  solving  a  single  set  of  linear  equations:  the  only  additional  expense  will  be  the  cost 
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of  the  sweep.  If  5  is  a  bound  on  the  number  of  outcomes  of  any  action,  then  this  cost  is 
0{{BAy'S  log  S),  typically  much  less  expensive  than  solving  the  linear  equations  (assum¬ 
ing  5,  A  <<  S).  For  PPI,  we  chose  to  use  the  pri2  schedule  from  equation  (2.5).  Unlike 
pri^  (equation  (2.4)),  pri2  forces  us  to  expand  states  with  high  pgoai  first,  even  when  we 
have  initialized  v  to  the  value  of  a  near-optimal  policy. 

In  order  to  guarantee  convergence,  we  need  to  set  7r(a;)  to  a  greedy  action  with  respect 
to  V  before  each  policy  evaluation.  Thus  in  the  update(a;)  method  of  PPI,  for  each  state 
y  for  which  there  exists  some  action  that  reaches  x,  we  re-calculate  Q{y^  h)  values  for  all 
actions  b.  In  IPS,  we  only  calculated  Q{y,  b)  for  actions  b  that  reach  x.  The  extra  work 
is  necessary  in  PPI  because  the  stored  Q  values  may  be  unrelated  to  the  current  v  (which 
was  updated  by  policy  evaluation),  and  so  otherwise  7r(a;)  might  not  be  set  to  a  greedy 
action.  Other  Q-value  update  schemes  are  possible,"^  and  will  lead  to  convergence  as  long 
as  they  fix  a  greedy  policy.  Note  also  that  extra  work  is  done  if  the  loops  in  update  are 
structured  as  in  Figure  2.5;  with  a  slight  reduction  in  clarity,  they  can  be  arranged  so  that 
each  predecessor  state  y  is  backed  up  only  once. 

One  important  additional  tweak  to  PPI  is  to  perform  multiple  sweeps  between  policy 
evaluation  steps.  Since  policy  evaluation  tends  to  be  more  expensive,  this  allows  a  better 
tradeoff  to  be  made  between  evaluation  and  improvement  via  expansions. 

A  potentially  important  optimization  is  to  restrict  the  policy  evaluations  to  a  subset  of 
the  states.  By  fixing  a  reduced  set  of  states  (called  an  envelope  [Dean  et  ah,  1995])  which 
contains  mostly  “important”  states,  we  can  hope  to  gain  most  of  the  benefits  of  policy 
evaluation  at  a  fraction  of  the  cost.  There  are  many  ways  to  pick  an  envelope;  for  example, 
the  LAO*  algorithm  [Hansen  and  Zilberstein,  2001]  is  one  popular  approach. 

2.3.3  Gauss-Dijkstra  Elimination 

The  Gauss-Dijkstra  Elimination  algorithm  continues  the  theme  of  taking  advantage  of  both 
Dijkstra’s  algorithm  and  efficient  policy  evaluation,  but  it  interleaves  them  at  a  deeper 
level. 


Gaussian  Elimination  and  MDPs  Fixing  a  policy  vr  for  an  MDP  produces  a  Markov 
chain  and  a  vector  of  costs  c.  If  our  MDP  has  S  states  (not  including  the  goal  state),  let 
be  the  S  x  S  matrix  with  entries  {P'^)xy  =  Pxy^^  for  all  x,y  ^  goal.  Finding  the  values  of 

For  example,  we  experimented  with  only  updating  Q{y,b)  when  Py^  >  0  in  update  and  then  doing 
a  single  full  backup  of  each  state  after  popping  it  from  the  queue,  ensuring  a  greedy  policy.  This  approach 
was  on  average  slower  than  the  one  presented  above. 
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the  MDP  under  the  given  policy  reduces  to  solving  the  linear  equations 


(/  -  p^)v  =  c 


To  solve  these  equations,  we  can  run  Gaussian  elimination  and  backsubstitution  on  the 
matrix  (J  —  P'^).  Gaussian  elimination  calls  rowEliminate(a;)  (defined  in  Figure  2.6, 
where  0  is  initialized  to  P'^  and  w  to  c)  for  all  x  from  1  to  5  in  order,^  zeroing  out  the 
subdiagonal  elements  of  —  Backsubstitution  calls  backsubstitute(a;)  for  all  x  from 
S  down  to  1  to  compute  (/  —  P'^)~^c.  In  Figure  2.6,  denotes  the  x’th  row  of  0,  and  Qy. 
denotes  the  r/’th  row.  We  show  updates  to  Pgoai(a^)  explicitly,  but  it  is  easy  to  implement 
these  updates  as  an  extra  dense  column  in  0. 

To  see  why  Gaussian  elimination  works  faster  than  Bellman  backups  in  MDPs  with 
cycles,  consider  again  the  Markov  chain  of  Figure  2.1.  While  value  iteration  reduces 
Bellman  error  by  only  1%  per  sweep  on  this  chain,  Gaussian  elimination  solves  it  exactly 
in  a  single  sweep.  The  starting  (/  —  P^)  matrix  and  c  vector  are: 


1 

0 

0 

0 

-0.99 

-0.01  ■ 

'1 

-1 

1 

0 

0 

0 

0 

1 

0 

-1 

1 

0 

0 

0 

1 

1 

0 

0 

-1 

1 

0 

0 

1 

0 

0 

0 

-1 

1 

0 

_1_ 

(for  clarity,  we  have  shown  — Pgoai(a:)  as  an  additional  column  separated  by  a  bar).  The 
first  call  to  rowEliminate  changes  row  2  to: 


0  1  0  0  -0.99  I  -0.01  ]  ,  [2] 


We  can  interpret  this  modified  row  2  as  a  macro-action:  we  start  from  state  2  and  execute 
our  policy  until  we  reach  a  state  other  than  1  or  2.  (In  this  case,  we  will  end  up  at  the 
goal  with  probability  0.01  and  in  state  5  with  probability  0.99.)  Each  subsequent  call  to 
rowEliminate  zeros  out  one  of  the  —Is  below  the  diagonal  and  defines  another  macro¬ 
action  of  the  form  “start  in  state  i  and  execute  until  we  reach  a  state  other  than  1  through 
i”  After  four  calls  we  are  left  with 


'  1  0  0  0  -0.99 

-0.01  ■ 

T' 

0  10  0  -0.99 

-0.01 

2 

0  0  10  -0.99 

-0.01 

3 

0  0  0  1  -0.99 

-0.01 

4 

0  0  0  -1  1 

0 

_1_ 

^Using  the  0  representation  causes  a  few  minor  changes  to  the  Gaussian  elimination  code,  but  it  has  the 
advantage  that  (0,  w)  can  always  be  interpreted  as  a  Markov  chain  which  is  has  the  same  value  function  as 
the  original  (P^,  c).  Also,  for  simplicity  we  will  not  consider  pivoting;  if  tt  is  a  proper  policy  then  (/  —  0) 
will  always  have  a  nonzero  entry  on  the  diagonal. 
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The  last  call  to  rowEliminate  zeros  out  the  last  subdiagonal  element  (in  line  (1)),  setting 
row  5  to: 

[  0  0  0  0  0.01  I  -0.01  ]  ,  [5]  (2.6) 

Then  it  divides  the  whole  row  by  0.01  (line  (2))  to  get: 

[  0  0  0  0  1  I  -1  ]  ,  [500]  (2.7) 

The  division  accounts  for  the  fact  that  we  may  visit  state  5  multiple  times  before  our 
macro-action  terminates:  equation  (2.6)  describes  a  macro-action  which  has  a  99%  chance 
of  self-looping  and  ending  up  back  in  state  5,  while  equation  (2.7)  describes  the  macro¬ 
action  which  keeps  going  after  a  self-loop  (an  average  of  100  times)  and  only  stops  when 
it  reaches  the  goal. 

At  this  point  we  have  defined  a  macro-action  for  each  state  which  is  guaranteed  to 
reach  either  a  higher-numbered  state  or  the  goal.  We  can  immediately  determine  that 
V*  (5)  =  500,  since  its  macro-action  always  reaches  the  goal  directly.  Knowing  the  value 
of  state  5  lets  us  determine  n*(4),  and  so  forth:  each  call  to  backsubstitute  tells  us  the 
value  of  at  least  one  additional  state. 

Note  that  there  are  several  possible  ways  to  arrange  the  elimination  computations  in 
Gaussian  elimination.  Our  example  shows  row  Gaussian  elimination,^  in  which  we  elim¬ 
inate  the  first  k  —  1  elements  of  row  k  by  using  rows  1  through  /c  —  1;  the  advantage  of 
using  this  ordering  for  GDE  is  that  we  need  not  fix  an  action  for  state  x  until  we  pop  it 
from  the  priority  queue  and  eliminate  its  row. 


Gauss-Dijkstra  Elimination  Gauss-Dijkstra  elimination  combines  the  above  Gaussian 
elimination  process  with  a  Dijkstra-style  priority  queue  that  determines  the  order  in  which 
states  are  selected  for  elimination.  The  main  loop  is  the  same  as  the  one  for  PPI,  except  that 
the  policy  evaluation  call  is  removed  and  sweep ()  is  replaced  by  GaussDijkstraSweep(). 
Pseudo-code  for  GaussDijkstraSweep()  is  given  in  Figure  2.6. 

When  X  is  popped  from  the  queue,  its  action  is  fixed  to  a  greedy  action.  The  outcome 
distribution  for  this  action  is  used  to  initialize  and  row  elimination  transforms  and 
w{x)  into  a  macro-action  as  described  above.  If  0x,goai  =  1,  then  we  fully  know  the  state’s 
value;  this  will  always  happen  for  the  l^lth  state,  but  may  also  happen  earlier.  We  do 
immediate  backsubstitution  when  this  occurs,  which  eliminates  some  non-zeros  above  the 
diagonal  and  possibly  causes  other  states’  values  to  become  known.  Immediate  backsub¬ 
stitution  ensures  that  v{x)  and  Pgo3.\{x)  are  updated  with  the  latest  information,  improving 

®This  sequence  is  called  the  Doolittle  ordering  when  used  to  compute  a  LU  factorization. 
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backsubstitute(a;) : 

for  each  y  such  that  >  0  do 

Pgoal(^)  Pgoal(^)  “1“  Oya; 

w{y)  ^  w{y)  +  Qyxv{x) 

if  (Pgoai(|/)  =  1)  then 

backsubstitute(y) 

F  ^  FU{y} 

end  if 
end  for 


main(): 

(Vx)  v{x)  ^  M 
v(goal)  ^  0 

while  (true)  do 

(Vx)  7r(a;)  ^  undefined 
GaussDijkstraSweepO 
if  ((max  Li  bellman  error)  <  toler¬ 
ance)  then 

declare  convergence 
end  if 
end  while 

GaussDijkstraSweepO : 
while  (not  queue.emptyO)  do 
X  ^  queue.popO 

7r(a;)  ^  argmin^  (^(a;,  a) 

(V|/)  Qxy  ^ 

w{x)  <—  c{x,  7l{x)) 

rowEliminate(x) 

v{x)  ^  (0X-)  ■  V  +  w{x) 

F  =  {x} 

if  (0x,goai  =  1)  then 

backsubstitute(a;) 

end  if 

(Vy  e  F)  update(y) 

end  while 


rowEliminate(a;) : 

for  (y  from  1  to  x-l)  do 

w{x)  ^  w{x)  +  QxyW{y) 

©a;.  ^  ©a;,  -f  Qxy^y-  (1) 

Pgoal(a^)  ^  Pgoal(a^)  F  ©xyPgoal(|/) 

©xy  ^0 

end  for 

w{x)  ^  w{x)/{l  -  Qxx) 

Qx.  ^  ©x./(l  -  ©xx)  (2) 

©XX^O 

Pgoal(a^)  ^  Pgoal(a^)/(l  ©xx) 


Figure  2.6:  Gauss-Dijkstra  Elimination.  The  update(|/)  method  is  the  same  one  used  for 
PPI,  but  with  the  pri2  priority  function. 


our  priority  estimates  for  states  on  the  queue  and  possibly  saving  us  work  later  (for  ex¬ 
ample,  in  the  case  when  our  transition  matrix  is  block  lower  triangular,  we  automatically 
discover  that  we  only  need  to  factor  the  blocks  on  the  diagonal).  Finally,  all  predeces¬ 
sors  of  the  state  popped  and  any  states  whose  values  became  known  are  updated  using 
the  update©  routine  for  PPI  (in  Figure  2.5).  However,  for  GDE  we  use  the  pri2  priority 
function. 

Since  S  can  be  large,  ©  will  usually  need  to  be  represented  sparsely.  Assuming  ©  is 
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stored  sparsely,  GDE  reduces  to  Dijkstra’s  algorithm  in  the  deterministic  case;  it  is  easy  to 
verify  the  additional  matrix  updates  require  only  0{S)  work.  In  a  general  MDP,  initially  it 
takes  no  more  memory  to  represent  0  than  it  does  to  store  the  dynamics  of  the  MDP,  but 
the  elimination  steps  can  introduce  many  additional  non-zeros.  The  number  of  such  new 
non-zeros  is  greatly  affected  by  the  order  in  which  the  eliminations  are  performed.  There 
is  a  vast  literature  on  techniques  for  finding  such  orderings;  Duff  et  al.  [1986]  provides 
a  good  introduction.  One  of  the  main  advantages  of  GDE  seems  to  be  that  for  practical 
problems,  the  prioritization  criteria  we  present  produce  good  elimination  orders  as  well  as 
effective  policy  improvement. 

Our  primary  interest  in  GDE  stems  from  the  wide  range  of  possibilities  for  enhanc¬ 
ing  its  performance;  even  in  the  naive  form  outlined  it  is  usually  competitive  with  PPL 
We  anticipate  that  doing  “early”  backsubstitution  when  states’  values  are  mostly  known 
(high  Pgoai(a^))  will  produce  even  better  policies  and  hence  fewer  iterations.  Eurther,  the 
interpretation  of  rows  of  0  as  macro-actions  suggests  that  caching  these  actions  may  yield 
dramatic  speed-ups  when  evaluating  the  MDP  with  a  different  goal  state.  The  useful¬ 
ness  of  macro-actions  for  this  purpose  was  demonstrated  by  Dean  and  Ein  [1995].  A 
convergence-checking  mechanism  such  as  those  used  by  ERTDP  and  HDP  [Bonet  and 
Geffner,  2003a,b]  could  also  be  used  between  iterations  to  avoid  repeating  work  on  por¬ 
tions  of  the  state  space  where  an  optimal  policy  and  value  function  are  already  known.  The 
key  to  making  GDE  widely  applicable,  however,  probably  lies  in  appropriate  thresholding 
of  values  in  0,  so  that  transition  probabilities  near  zero  are  thrown  out  when  their  contri¬ 
bution  to  the  Bellman  error  is  negligible.  Our  current  implementation  does  not  do  this,  so 
while  its  performance  is  good  on  many  problems,  it  can  perform  poorly  on  problems  that 
generate  lots  of  fill-in. 


2.3.4  Incremental  Expansions 

In  describing  IPS,  PPI,  and  GDE  we  have  touched  on  a  number  of  methods  of  updating 
V  and  Q  values.  In  summary:  Value  iteration  iteration  repeatedly  backs  up  states  in  an 
arbitrary  order.  Prioritized  sweeping  backs  up  states  in  an  order  determined  by  a  priority 
queue.  PPI  and  GDE  also  pop  states  from  a  priority  queue,  but  rather  than  backing  up  the 
popped  state,  they  backup  up  all  of  its  predecessors.  IPS  pops  states  from  a  priority  queue, 
but  instead  of  fully  backing  up  the  predecessors  of  the  popped  state  x,  it  only  recomputes 
Q  values  for  actions  that  might  reach  x. 

Here  we  provide  a  more  thorough  accounting  of  the  expansion  mechanism  used  by 
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IPS.  Suppose  we  are  given  an  initial  upper  bound  Void  on  v*.  Then,  we  can  define  Q  by 

Q{x,  a)  =  c(x,  a)  +  ^  P^yVoidiv) 
y 

and  then  Vnew  by  Vnswix)  =  mina  Q{x,  a).  Note  that  rather  than  storing  Wnew  we  can  simply 
store  Q  and  7r(a;),  the  greedy  policy  with  respect  to  Vom-  Our  goal  in  an  expansion  operation 
is  to  set  Woid(a^)  Vnew{x),  and  then  update  Q  so  it  reflects  this  change,  and  then  update 
Wnew  so  that  again  nnew(a^)  =  niin^  Q{x,  a).  Perhaps  the  easiest  way  to  ensure  this  property 
is  via  a  full  expansion  of  the  state  x: 

noid(a:)  ^  Q{x,'k{x)) 
for  all  (y,  h)  G  pred(a;)  do 

Q{y,  h)  ^  c{y,  h)  +  Ex'esucc(y,b)  Pix'^ouix') 

\i{Q{y,h)<  Q(|/,7r(|/)))then 
Ti{y)  ^  b 
end  if 
end  for 

Doing  such  a  full  expansion  requires  0{B)  work  per  predecessor  state-action  pair.  We  can 
accomplish  the  same  task  with  0(1)  work  if  we  assume  without  loss  of  generality^  (Vx,  a) 
Pfx  =  0,  and  perform  an  incremental  expansion: 

A(x)  ^  Q(x,  7r(x))  -  Void(x) 
for  all  (y,b)  G  pred(a;)  do 
Qin,  b)  ^  Q{y,  b)  +  Py^A{x) 
if  (Q(l/,  b)  <  Qiy,  TT{y)))  then 
TT{y)  ^  b 
end  if 
end  for 

t'oid(a:)  ^  Q{x,7i{x)) 

However,  when  doing  a  full  expansion,  we  have  a  better  option  for  calculating  Q{y,  b)  than 
the  one  given  above.  We  can  update  Q{y,  b)  using  Q{x' ,  7r(a;'))  in  place  of  Woid(a^0’ 

^Suppose  >  0.  There  exists  an  optimal  stationary  policy,  so  if  a  is  selected  and  a  self-loop  occurs, 
it  is  safe  to  assume  that  action  a  is  selected  again,  until  a  new  state  is  reached.  In  expectation  this  will  take 
1/(1  —  trials,  so  in  a  pre-processing  step  we  replace  a  with  action  a',  which  is  equivalent  to  taking  a 
until  a  new  state  is  reached;  we  have  c{x,  a')  =  c(x,  a)/(l  —  P^x)^  with  transition  probabilities  given  by 
setting  Pxx  =  0,  and  normalizing  Pxy  for  all  y  f  x. 
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update(a;): 

for  all  (i/,  b)  G  pred(a;)  do 

Qtemp  ^  C{y,  b)  +  Ea;'esucc(y,b)  Pyx'^i^') 

if  (Qtemp  <v{y))  then 

P^i  ^  (Qtemp  ^(l/))/(Qtemp  “1“  1) 

TT{y)  ^  b 

if  (|w(l/)  -  QtempI  >  e)  then 
queue. decreasepriority(|/,  pri) 

end  if 

^(.y')  ^  Qtemp 

end  if 
end  for 


Figure  2.7:  The  update  function  for  the  Improved  Prioritized  Sweeping  algorithm,  im¬ 
plemented  with  a  single  value-function  array  v  and  a  temporary  variable  Qtemp- 


this  may  offer  a  tighter  upper  bound  because  Q{x' ,n{x'))  <  v{x')  when  we  pessimisti¬ 
cally  initialize.  In  our  experiments,  this  method  proved  superior  to  doing  incremental 
expansions,  and  it  is  the  method  used  by  Improved  Prioritized  Sweeping  (see  Figure  2.4 
for  the  code).  However,  on  certain  problems  incremental  expansions  may  give  superior 
performance.  IPS  based  on  incremental  expansions  tends  to  do  more  updates  (at  lower 
cost)  and  so  priority  queue  operations  account  for  a  larger  fraction  of  its  running  times. 
Thus,  fast  approximate  priority  queues  might  offer  a  significant  advantage  to  incremental 
IPS  implementations. 

One  final  implementation  note.  Our  pseudocode  for  IPS  and  PPI  indicates  that  Q 
values  for  all  actions  are  stored.  While  this  is  necessary  if  incremental  expansions  are 
performed,  we  do  full  expansions  so  the  extra  storage  is  not  required.  It  is  sufficient  to  store 
a  single  value  for  each  state,  which  takes  the  place  of  Qoid  and  v  in  the  pseudocode;  newly 
calculated  Q{y,b)  values  can  be  replaced  by  a  temporary  variable  Qt;  the  value  is  only 
relevant  if  it  causes  v{y)  to  change,  in  which  case  we  immediately  assign  v{y)  the  value  of 
the  temporary  for  Q{y,  b)  rather  than  waiting  until  y  is  popped  from  the  queue.  Figure  (2.7) 
shows  this  modification  to  the  original  IPS  update  method  given  in  Figure  (2.4). 
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2.3.5  Experimental  Results 

We  implemented  IPS,  PPI,  and  GDE  and  compared  them  to  VI,  Prioritized  Sweeping, 
and  LRTDP.  All  algorithms  were  implemented  in  Java  1.5.0  and  tested  on  a  3Ghz  Intel 
machine  with  2GB  of  main  memory  under  Linux. 

Our  PPI  implementation  uses  a  stabilized  biconjugate  gradient  solver  with  an  incom¬ 
plete  LU  preconditioners  as  implemented  in  the  Matrix  Toolkit  for  Java  [Heimsund,  2004]. 
No  native  or  optimized  code  was  used;  using  architecture-tuned  implementations  of  the 
underlying  linear  algebraic  routines  could  give  a  significant  speedup. 

For  LRTDP  we  specified  a  few  reasonable  start  states  for  each  problem.  Typically 
LRTDP  converged  after  labeling  only  a  small  fraction  of  the  the  state  space  as  solved,  up 
to  about  25%  on  some  problems. 


Experimental  Domain 

We  describe  experiments  in  a  discrete  4-dimensional  planning  problem  that  captures  many 
important  issues  in  mobile  robot  path  planning.  Our  domain  generalizes  the  racetrack 
domain  described  previously  in  [Barto  et  ah,  1995,  Bonet  and  Geffner,  2003b,a,  Hansen 
and  Zilberstein,  2001].  A  state  in  this  problem  is  described  by  a  4-tuple,  s  =  {x,  y,  dx,  dy), 
where  (x,  y)  gives  the  location  in  a  2D  occupancy  map,  and  {dx,  dy)  gives  the  robot’s 
current  velocity  in  each  dimension.  On  each  time  step,  the  agent  selects  an  acceleration 
a  =  {ax,  ay)  E  {  —  1,  0, 1}^  and  hopes  to  transition  to  state  {x  -f  dx,  y  +  dy,  dx  +  ax,  dy  -f 
ay).  However,  noise  and  obstacles  can  affect  the  actual  result  state.  If  the  line  from  {x,  y) 
to  {x  +  dx,  y+dy)  in  the  occupancy  grid  crosses  an  occupied  cell,  then  the  robot  “crashes,” 
moving  to  the  cell  just  prior  to  the  obstacle  and  losing  all  velocity.  (The  robot  does  not 
reset  to  the  start  state  as  in  some  racetrack  models.)  Additionally,  the  robot  may  be  affected 
by  several  types  of  noise: 

•  Action  Failure  With  probability  fp,  the  requested  acceleration  fails  and  the  next 
state  is  {x  +  dx,  y  +  dy,  dx,  dy). 

•  Local  Noise  To  model  the  fact  that  some  parts  of  the  world  are  more  stochastic 
than  others,  we  mark  certain  cells  in  the  occupancy  grid  as  “noisy,”  along  with  a 
designated  direction.  When  the  robot  crosses  such  a  cell,  it  has  a  probability  of 
experiencing  an  acceleration  of  magnitude  1  or  2  in  the  designated  direction. 

•  One-way  passages  Cells  marked  as  “one-way”  have  a  specified  direction  (north, 
south,  east,  or  west),  and  can  only  be  crossed  if  the  agent  is  moving  in  the  indicated 
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|5| 

fp 

fe 

%  determ 

0 

notes 

A 

59,780 

0.00 

0.00 

100.0% 

1.00 

deterministic 

B 

96,736 

0.05 

0.10 

17.2% 

2.17 

|A|  =  1 

C 

11,932 

0.20 

0.00 

25.1% 

4.10 

fh  =  0.05 

D 

10,072 

0.10 

0.25 

39.0% 

2.15 

cycle 

E 

96,736 

0.00 

0.20 

90.8% 

2.41 

F 

21,559 

0.20 

0.00 

34.5% 

2.00 

large-b 

G 

27,482 

0.10 

0.00 

90.4% 

3.00 

Table  2.1:  Test  problems  sizes  and  parameters. 


direction.  Any  non- zero  velocity  in  another  direction  results  in  a  crash,  leaving  the 
agent  in  the  one-way  state  with  zero  velocity. 

•  High-velocity  noise  If  the  robot’s  velocity  surpasses  an  L2  threshold,  it  incurs  a 
random  acceleration  on  each  time  step  with  probability  fh.  This  acceleration  is 
chosen  uniformly  from  {—1,  0, 1}^,  excluding  the  (0, 0)  acceleration. 

These  additions  to  the  domains  allow  us  to  capture  a  wider  variety  of  planning  problems.  In 
particular,  kinodynamic  path  planning  for  mobile  robots  generally  has  more  noise  (more 
possible  outcomes  of  a  given  action  as  well  as  higher  probability  of  departure  from  the 
nominal  command)  than  the  original  racetrack  domain  allows.  Action  failure  and  high- 
velocity  noise  can  be  caused  by  wheels  slipping,  delays  in  the  control  loop,  bumpy  terrain, 
and  so  on.  One-way  passages  can  be  used  to  model  low  curbs  or  other  map  features  that 
can  be  passed  in  only  one  direction  by  a  wheeled  robot.  And,  local  noise  can  model  a 
robot  driving  across  sloped  terrain:  downhill  accelerations  are  easier  than  uphill  ones. 

Table  2. 1  summarizes  the  parameters  of  the  test  problems  we  used.  The  “%  determ” 
column  indicates  the  percentage  of  (s,  a)  pairs  with  deterministic  outcomes;  our  imple¬ 
mentation  uses  a  deterministic  transition  to  apply  the  collision  cost,  so  all  problems  have 
some  deterministic  transitions.  The  O  column  gives  the  average  number  of  outcomes  for 
non-deterministic  transitions.  All  problems  have  9  actions  except  for  (B),  which  is  a  pol¬ 
icy  evaluation  problem.  Problem  (C)  has  high  velocity  noise,  with  a  threshold  of  \/2  -f  e. 
Figure  2.8  shows  the  2D  world  maps  for  most  of  the  problems. 

To  construct  larger  problems  for  some  of  our  experiments,  we  consider  linking  copies 
of  an  MDP  in  series  by  making  the  goal  state  of  the  Ah  copy  transitions  to  the  start  state 
of  the  (i  -f  l)st  copy.  We  indicate  k  serial  copies  of  an  MDP  M  by  M^,  so  for  example  22 
copies  of  problem  (G)  is  denoted  (G^^). 
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Figure  2.8:  Some  maps  used  for  test  experiments;  maps  are  not  drawn  to  the  same  scale. 
Problem  (E)  uses  the  same  map  as  (B).  Problem  (G)  uses  a  smaller  version  of  map  (B). 
Special  states  (one-way  passages,  local  noise)  are  indicated  by  light  grey  symbols;  contact 
the  authors  for  full  map  specifications. 


Experimental  Results 

Effects  of  Local  Noise  First,  we  considered  the  effect  of  increasing  the  randomness 
fe  and  fp  for  the  fixed  map  (G),  a  smaller  version  of  (B).  One-way  passages  give  this 
complex  map  the  possibility  for  cycles.  Figure  2.9  shows  the  run  times  (y-axis)  of  several 
algorithms  plotted  against  fp.  The  parameter  was  set  to  0.5/p  for  each  trial. 

These  results  demonstrate  the  catastrophic  effect  increased  noise  can  have  on  the  per¬ 
formance  of  VI.  For  low-noise  problems,  VI  converges  reasonably  quickly,  but  as  noise 
is  increased  the  expected  length  of  trajectories  to  the  goal  grows,  and  VTs  performance 
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Figure  2.9:  Effect  of  local  noise  on  solution  time.  The  leftmost  data  point  is  for  the 
deterministic  problem.  Note  that  PPI-4  exhibits  almost  constant  runtime  even  as  noise  is 
increased. 


degrades  accordingly.  IPS  performs  somewhat  better  overall,  but  it  suffers  from  this  same 
problem  as  the  noise  increases.  However,  PPI’s  use  of  policy  evaluation  steps  quickly 
propagates  values  through  these  cycles,  and  so  its  performance  is  almost  totally  unaffected 
by  the  additional  noise.  PPI-4  beats  VI  on  all  trials.  It  wins  by  a  factor  of  2.4  with 
fp  =  0.05,  and  with  fp  =  0.4  PPI-4  is  29  times  faster  than  VI. 

The  dip  in  runtimes  for  LRTDP  is  probably  due  to  changes  in  the  optimal  policy,  and 
the  number  and  order  in  which  states  are  converged.  Confidence  intervals  are  given  for 
LRTDP  only,  as  it  is  a  randomized  algorithm.  The  deterministic  algorithms  were  run 
multiple  times,  and  deviations  in  runtimes  were  negligible. 


Number  of  Policy  Evaluation  Steps  Policy  iteration  is  an  attractive  algorithm  for  MDPs 
where  policy  evaluation  via  backups  or  expansions  is  likely  to  be  slow.  It  is  well  known 
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that  policy  iteration  typically  converges  in  few  iterations.  However,  Figure  2.10  shows  that 
our  algorithms  can  greatly  reduce  the  number  of  iterations  required.  In  problems  where 
policy  evaluation  is  expensive,  this  can  provide  a  significant  overall  savings  in  computation 
time. 

The  number  of  iterations  that  standard  policy  iteration  takes  to  converge  depends  on  the 
initial  policy.  We  experimented  with  initializing  to  the  uniform  stochastic  policy,^  random 
policies  that  at  least  give  all  states  finite  value,  and  an  optimal  policy  for  the  deterministic 
relaxation  of  the  problem.^  The  choice  of  initial  policy  rarely  changed  the  number  of 
iterations  by  more  than  2  or  3,  and  in  almost  all  cases  initializing  with  the  policy  from  the 
deterministic  relaxation  gave  the  best  performance.  Policy  iteration  was  initialized  in  this 
way  for  the  results  in  Figure  2. 10. 

We  compare  policy  iteration  to  PPI,  where  we  use  either  1,2,  or  4  sweeps  of  Dijkstra 
policy  improvement  between  iterations.  We  also  ran  GDE  on  these  problems.  Typically  it 
required  the  same  number  of  iterations  as  PPI,  but  we  hope  to  improve  upon  this  perfor¬ 
mance  in  future  work. 


Q-value  Computations  Our  implementation  are  optimized  not  for  speed  but  for  ease  of 
use,  instrumentation,  and  modification.  We  expect  our  algorithms  to  benefit  much  more 
from  tuning  than  value  iteration.  To  show  this  potential,  we  compare  IPS,  PS,  and  VI  on 
the  number  of  Q-value  computations  (Q-comps)  they  perform.  A  single  Q-comp  means 
iterating  over  all  the  outcomes  for  a  given  (s,  a)  pair  to  calculate  the  current  Q  value.  A 
backup  takes  |A|  Q-comps,  for  example.  We  do  not  compare  PPI-4,  GDE,  and  ERTDP 
based  on  this  measure,  as  they  also  perform  other  types  of  computation. 

IPS  typically  needed  substantially  fewer  Q-comps  than  VI.  On  the  deterministic  prob¬ 
lem  (A),  VI  required  255  times  as  many  Q-comps  as  IPS,  due  to  IPS’s  reduction  to  Dijk- 
stra’s  algorithm;  VI  made  7.3  times  as  many  Q-comps  as  PS.  On  problems  (B)  through  (E), 
VI  on  average  needed  15.29  times  as  many  Q-comps  as  IPS,  and  5.16  times  as  many  as  PS. 
On  (G^^)  it  needed  36  times  as  many  Q-comps  as  IPS.  However,  these  large  wins  in  num¬ 
ber  of  Q-comps  are  offset  by  value  iteration’s  higher  throughput:  for  example,  on  problems 
(B)  through  (E)  VI  averaged  27,630  Q-comps  per  millisecond,  while  PS  averaged  4,033 

*This  is  a  poor  initialization  not  only  because  it  is  an  ill-advised  policy,  but  also  because  it  often  produces 
a  poorly-conditioned  linear  system  that  is  difficult  to  solve 

®This  is  the  policy  chosen  by  an  agent  who  can  choose  the  outcome  of  each  action,  rather  than  having 
an  outcome  sampled  from  the  problem  dynamics.  This  policy  and  its  value  function  can  be  computed  by 
any  shortest  path  algorithm  or  A*  if  a  heuristic  is  available.  Note  that  this  policy  is  different  than  the  greedy 
policy  with  respect  to  the  value  function  of  the  deterministic  relaxation,  which  need  not  even  be  a  proper 
policy.  We  will  discuss  this  issue  in  greater  depth  in  Section  (2.4.1). 
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Figure  2.10:  Number  of  policy  evaluation  steps. 

and  IPS  averaged  3,393.  PS  and  IPS  will  always  have  somewhat  more  overhead  per  Q- 
comp  than  VI.  However,  replacing  the  standard  binary  heap  we  implemented  with  a  more 
sophisticated  algorithm  or  with  an  approximate  queuing  strategy  could  greatly  reduce  this 
overhead,  possibly  leading  to  significantly  improved  performance. 

Figure  2.1 1  compares  the  number  of  Q-comps  required  to  solve  serially  linked  copies 
of  problem  (D):  the  x-axis  indicates  the  number  of  copies,  from  (D^)  to  (D®).  VI  still 
has  competitive  run-times  because  it  performs  Q-comps  much  faster.  On  (D®)  it  averages 
41,360  Q-comps  per  millisecond,  while  PS  performs  only  4,453  and  IPS  only  3,871. 


Overall  Performance  of  Solvers  Figure  2.12  shows  a  comparison  of  the  run-times  of 
our  solvers  on  the  various  test  problems.  Problem  (G^^)  has  623,964  states,  showing  that 
our  approaches  can  scale  to  large  problems.  On  (G^^),  the  stabilized  biconjugate  gra- 

'°This  experiment  was  run  on  a  different  (though  similar)  machine  than  the  other  experiments,  a  3.4GHz 
Pentium  under  Linux  with  1GB  of  memory. 
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Figure  2.1 1:  Comparison  of  number  of  Q-computations  performed  by  IPS,  PS,  and  VI  to 
solve  serially-linked  copies  of  problem  (D). 


dient  algorithm  failed  to  converge  on  the  initial  linear  systems  produced  by  PPI-4,  so  we 
instead  used  PPI  where  28  initial  sweeps  were  made  (so  that  there  was  a  reasonable  policy 
to  be  evaluated  initially),  and  then  7  sweeps  were  made  between  subsequent  evaluations. 
We  also  found  that  adding  a  pass  of  standard  greedy  policy  improvement  after  the  sweeps 
improved  performance.  These  changes  roughly  balanced  the  time  spent  on  sweeping  and 
policy  improvement.  In  future  work  we  hope  to  develop  more  principled  and  automatic 
methods  for  determining  how  to  split  computation  time  between  sweeps  and  policy  evalu¬ 
ation.  We  did  not  run  PS,  LRTDP,  or  GDE  on  this  problem. 

Generally,  our  algorithms  do  best  on  problems  that  are  sparsely  stochastic  (only  have 
randomness  at  a  few  states)  and  also  on  domains  where  typical  trajectories  are  long  relative 
to  the  size  of  the  state  space.  These  long  trajectories  cause  serious  difficulties  for  methods 
that  do  not  use  an  efficient  form  of  policy  evaluation.  For  similar  reasons,  our  algorithms 
do  better  on  long,  narrow  domains  rather  than  wide  open  ones;  the  key  factor  is  again  the 
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Problem 

Figure  2.12:  Comparison  of  a  selection  of  algorithms  on  representative  problems.  Prob¬ 
lem  (A)  is  deterministic,  and  Problem  (B)  requires  only  policy  evaluation.  Results  are 
normalized  to  show  the  fraction  of  the  longest  solution  time  taken  by  each  algorithm.  On 
problems  (B)  and  (E),  the  slowest  algorithms  were  stopped  before  they  had  converged. 
LRTDP  is  not  charged  for  time  spent  calculating  its  heuristic,  which  is  negligible  in  all 
problems  except  (A). 


expected  length  of  the  trajectories  versus  the  size  of  the  state  space. 

Value  iteration  backed  up  states  in  the  order  in  which  states  were  indexed  in  the  inter¬ 
nal  representation;  this  order  was  generated  by  a  breadth-first  search  from  the  start  state 
to  find  all  reachable  states.  While  this  ordering  provides  better  cache  performance  than 
a  random  ordering,  we  ran  a  minimal  set  of  experiments  and  observed  that  the  natural 
ordering  performs  somewhat  worse  (up  to  20%  in  our  limited  experiments)  than  random 
orderings.  Despite  this,  we  observed  better  than  expected  performance  for  value  itera¬ 
tion,  especially  as  it  compares  to  LRTDP  and  Prioritized  Sweeping.  For  example,  on  the 
large-b  problem  (F),  [Bonet  and  Geffner,  2003a]  reports  a  slight  win  for  LRTDP  over 
VI,  but  our  experiments  show  VI  being  faster. 

Also,  GDE’s  performance  is  typically  close  to  or  better  than  that  of  PPI-4,  except  on 
problem  (B),  where  GDE  fails  due  to  moderately  high  fill  in.  These  results  are  encour¬ 
aging  because  GDE  already  sometimes  performs  better  than  PPI-4,  and  currently  GDE  is 
based  on  a  naive  implementation  of  Gaussian  elimination  and  sparse  matrix  code.  The 
literature  in  the  numerical  analysis  community  shows  that  more  advanced  techniques  can 
yield  dramatic  speedups  (see,  for  example,  [Gupta,  2002]). 


35 


2.3.6  Discussion 


The  success  of  Dijkstra’s  algorithm  has  inspired  many  algorithms  for  MDP  planning  to 
use  a  priority  queue  to  try  to  schedule  when  to  visit  each  state.  However,  none  of  these 
algorithms  reduce  to  Dijkstra’s  algorithm  if  the  input  happens  to  be  deterministic.  And, 
more  importantly,  they  are  not  robust  to  the  presence  of  noise  and  cycles  in  the  MDP. 
For  MDPs  with  significant  randomness  and  cycles,  no  algorithm  based  on  backups  or 
expansions  can  hope  to  remain  efficient.  Instead,  we  turn  to  algorithms  which  explicitly 
solve  systems  of  linear  equations  to  evaluate  policies  or  pieces  of  policies. 

We  have  introduced  a  family  of  algorithms — Improved  Prioritized  Sweeping,  Prior¬ 
itized  Policy  Iteration,  and  Gauss-Dijkstra  Elimination — which  retain  some  of  the  best 
features  of  Dijkstra’s  algorithm  while  integrating  varying  amounts  of  policy  evaluation. 
We  have  evaluated  these  algorithms  in  a  series  of  experiments,  comparing  them  to  other 
well-known  MDP  planning  algorithms  on  a  variety  of  MDPs.  Our  experiments  show  that 
the  new  algorithms  can  be  robust  to  noise  and  cycles,  and  that  they  are  able  to  solve  many 
types  of  problems  more  efficiently  than  previous  algorithms  could. 

For  problems  which  are  fairly  close  to  deterministic  or  with  only  moderate  noise  and 
cycles,  we  recommend  Improved  Prioritized  Sweeping.  For  problems  with  fast  mixing 
times  or  short  average  path  lengths,  value  iteration  is  hard  to  beat  and  is  probably  the 
simplest  of  all  of  the  algorithms  to  implement.  For  general  use,  we  recommend  the  Pri¬ 
oritized  Policy  Iteration  algorithm.  It  is  simple  to  implement,  and  can  take  advantage  of 
fast,  vendor-supplied  linear  algebra  routines  to  speed  policy  evaluation.  All  of  these  ap¬ 
proaches  are  most  appropriate  when  the  agent  may  visit  a  large  fraction  of  the  state  space, 
either  because  the  agent’s  start  state  is  unknown  or  because  reaching  the  goal  requires  vis¬ 
iting  much  of  the  state  space.  In  the  next  section,  we  consider  problems  where  a  fixed 
start  state  is  known,  and  it  is  possible  (with  high  probability)  to  reach  the  goal  while  only 
visiting  a  small  fraction  of  the  state  space. 


2.4  Bounded  Real-Time  Dynamic  Programming 

In  this  section  we  consider  the  problem  of  finding  a  policy  in  a  Markov  decision  process 
with  a  fixed  start  state  s,  a  fixed  zero-cost  absorbing  goal  state  g,  and  non-negative  costs. 
An  arbitrary  distribution  over  initial  states  can  be  modeled  by  adding  an  imaginary  start 
state  with  a  single  action  that  produces  the  desired  distribution.  Perhaps  the  simplest 
algorithm  for  this  problem  is  value  iteration,  which  solves  for  an  optimal  policy  on  the  full 
state  space.  Many  realistic  problems,  however,  are  too  large  for  such  an  approach  and  often 
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only  a  small  fraction  of  the  the  state  space  is  relevant  to  the  problem  of  reaching  g  from  s. 
This  fact  has  inspired  the  development  of  a  number  of  algorithms  that  focus  computation 
on  states  that  seem  to  be  most  relevant  to  finding  an  optimal  policy  from  s.  Such  algorithms 
include  Real-Time  Dynamic  Programming  (RTDP)  [Barto  et  ah,  1995],  Labeled-RTDP 
(LRTDP)  [Bonet  and  Geffner,  2003b],  LAO*  [Hansen  and  Zilberstein,  2001],  Heuristic 
Search/DP  (HDP)  [Bonet  and  Geffner,  2003a],  Envelope  Propagation  (EP)  [Dean  et  ah, 
1995],  and  Eocused  Dynamic  Programming  (EP)  [Eerguson  and  Stentz,  2004]. 

Many  of  these  algorithms  use  heuristics  (lower  bounds  on  the  optimal  value  function) 
and/or  sampled  greedy  trajectories  to  focus  computation.  In  this  section,  we  introduce 
Bounded  RTDP  (BRTDP),  which  is  based  on  RTDP  and  uses  both  a  lower  bound  and 
sampled  trajectories.  Unlike  RTDP,  however,  it  also  maintains  an  upper  bound  on  the 
optimal  value  function,  which  allows  it  to  focus  on  states  that  are  both  relevant  (frequently 
reached  under  the  current  policy)  and  poorly  understood  (large  gap  between  upper  and 
lower  bound).  Eurther,  acting  greedily  with  respect  to  an  appropriate  upper  bound  allows 
BRTDP  to  make  anytime  performance  guarantees. 

Binding  an  appropriate  upper  bound  to  initialize  BRTDP  can  greatly  impact  its  perfor¬ 
mance.  One  of  the  contributions  of  this  work  is  an  efficient  algorithm  for  finding  such  an 
upper  bound.  Nevertheless,  our  experiments  show  that  BRTDP  performs  well  even  when 
initialized  naively. 

We  evaluate  BRTDP  on  two  criteria:  off-line  convergence,  the  time  required  to  find  an 
approximately  optimal  partial  policy  before  any  actions  are  taken  in  the  real  world;  and 
anytime  performance,  the  ability  to  produce  a  reasonable  partial  policy  at  any  time  after 
computation  is  started. 

Our  experiments  show  that  when  run  off-line,  BRTDP  often  converges  much  more 
quickly  than  ERTDP  and  HDP,  which  are  known  to  have  good  off-line  convergence  proper¬ 
ties.  In  fact,  the  gap  in  offline  performance  between  BRTDP  and  competing  algorithms  can 
be  arbitrarily  large  because  of  differences  in  how  they  check  convergence.  HDP,  ERTDP, 
and  EAO*  (and  most  other  algorithms  of  which  we  are  aware^^)  have  convergence  guaran¬ 
tees  based  on  achieving  small  Bellman  residual  on  all  states  reachable  under  the  current 
policy,  while  BRTDP  only  requires  a  small  residual  on  states  reachable  with  significant 
probability.  Eet  ffiy)  be  the  expected  number  of  visits  to  state  y  given  that  the  agent  starts 
at  s  and  executes  policy  tt.  We  say  an  MDP  has  dense  noise  if  all  policies  have  many 
nonzero  entries  in  For  example,  planning  problems  with  action  errors  have  >  0 

^After  the  initial  submission  of  this  work,  it  was  pointed  out  that  our  exploration  strategy  is  similar  to 
that  of  the  HSVI  algorithm  Smith  and  Simmons  [2004];  since  HSVI  is  designed  for  POMDPs  rather  than 
MDPs,  the  forms  of  the  bounds  that  it  maintains  are  different  from  ours,  and  its  backup  operations  are  much 
more  expensive. 
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for  all  reachable  states.  (Action  errors  mean  that,  with  some  small  probability,  we  take  a 
random  action  rather  than  the  desired  one.)  Dense  noise  is  fairly  common,  particularly  in 
domains  from  robotics.  For  example,  Gaussian  errors  in  movement  will  make  every  state 
have  positive  probability  of  being  visited.  Gaussian  motion-error  models  are  widespread, 
e.g.  Ng  et  al.  [2004].  Unpredictable  motion  of  another  agent  can  also  cause  large  num¬ 
bers  of  states  to  have  positive  visitation  probability;  an  example  of  this  sort  of  model  is 
described  by  Roy  et  al.  [2004]. 

For  HDP  or  LRTDP  to  converge  on  problems  with  dense  noise,  they  must  do  work  that 
is  at  least  linear  in  the  number  of  nonzero  entries  in  /^r,  even  if  most  of  those  entries  are 
almost  zero.  BRTDP’s  bounds  allow  it  to  make  performance  guarantees  on  MDPs  with 
dense  noise  without  doing  work  linear  in  the  number  of  states  reachable  under  its  greedy 
policy,  potentially  making  it  arbitrarily  faster  than  HDP,  LRTDP,  and  LAO*. 

When  used  as  an  anytime  algorithm,  a  suitably-initialized  BRTDP  consistently  out¬ 
performs  a  similarly  initialized  RTDP  (which  is  known  to  have  good  anytime  properties). 
Without  any  initialization  information,  BRTDP  is  competitive  with  RTDP  and  sometimes 
better.  Furthermore,  given  reasonable  initialization  assumptions,  BRTDP  will  always  re¬ 
turn  a  policy  with  provable  performance  bounds.  We  know  of  no  other  MDP  algorithms 
with  this  property. 

In  the  next  section  we  establish  notation  and  define  concepts  needed  to  describe  our 
algorithm.  We  then  propose  an  algorithm  for  finding  a  monotone  upper  bound  in  time  lin¬ 
ear  in  the  size  of  the  state  space.  Section  2.4.3  explains  BRTDP  in  detail  and  Section  2.4.4 
describes  different  initialization  scenarios  and  associated  guarantees.  In  section  2.4.5  we 
formalize  our  notions  of  off-line  convergence  and  anytime  performance,  and  demonstrate 
that  BRTDP  can  outperform  existing  algorithms  on  both  of  these  tasks. 


2.4.1  Basic  Results 

We  will  again  work  with  value  functions.  Recall  that  the  Bellman  error  for  a  value  func¬ 
tion  n  at  a  state  x  is  given  by  bet,(a;)  =  v{x)  —  minae^  Qv{x,  a).  We  are  particularly  inter¬ 
ested  in  monotone  value  functions:  v  is  monotone  optimistic  (a  monotone  lower  bound)  if 
Vx,  be^(a;)  <  0  and  monotone  pessimistic  (monotone  upper  bound)  if  Vx,  bet,(a;)  >  0. 
We  use  the  following  two  theorems,  which  can  be  proved  using  techniques  from  [Bertsekas 
and  Tsitsiklis,  1996,  Sec.  2.2]. 

Theorem  2.4.1.  Ifv  is  monotone  pessimistic,  then  v  is  an  upper  bound  on  v*.  Similarly,  if 
V  is  monotone  optimistic,  then  v  is  a  lower  bound  on  v*. 
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Figure  2.13:  An  MDP  where  the  greedy  policy  with  respect  to  Vd,  the  values  from  the 
deterministic  relaxation,  is  improper.  Costs  are  c{x,  a)  =  1  and  c{x,  b)  =  10. 


Theorem  2.4.2.  Suppose  Vu  is  a  monotone  upper  bound  on  v*.  If  n  is  the  greedy  policy 
with  respect  to  Vu,  for  all  x,  v-„{x)  <  Vu{x). 

No  analog  to  Theorem  2.4.2  exists  for  a  greedy  policy  based  on  a  lower  bound  Vi, 
monotone  or  otherwise:  such  a  policy  may  be  arbitrarily  bad.  Consider  an  arbitrary 
stochastic  shortest  path  problem  M  =  (S',  A,  P,  c,  s,  g),  and  consider  the  values  found 
by  solving  Aid,  the  deterministic  relaxation  of  Ai.  That  is,  in  Ad ^  the  planner  can  choose 
any  outcome  of  any  action  from  each  state,  rather  than  choosing  an  action  and  then  facing 
a  stochastic  outcome.  It  is  easy  to  show  that  the  optimal  values  Vd  for  Aid  are  a  mono¬ 
tonic  lower  bound  on  v*.  Further,  Aid  is  deterministic,  so  it  can  be  solved  via  A*  or 
Dijkstra’s  algorithm.  However,  greedy(t;(i)  need  not  even  be  proper.  Consider  the  MDP 
shown  in  Figure  2.13,  and  suppose  c{x,a)  =  1  and  c{x,b)  =  10.  Then,  Vd{x)  =  10, 
and  so  Qvai^,  a)  =  H  and  Qv^i^,  b)  =  19  and  the  greedy  policy  with  respect  to  Vd  thus 
always  selects  action  a.  This  observation  is  important  because  RTDP,  LRTDP,  HDP,  and 
LAO*  are  often  initialized  to  Vd,  and  they  select  actions  greedily  with  respect  to  their  value 
functions.  Thus,  initially  these  algorithms  may  produce  arbitrarily  bad  stationary  policies. 

A  proper  policy,  however,  can  always  be  extracted  from  Vd-  In  order  for  x  to  get  a 
value  Vd{x)  there  must  exist  some  y  G  S  and  a  G  A  satisfying  Pfy  >  0  and  Vd{x)  = 
c{x,  a)  +  Vd{y).  Then,  it  is  not  hard  to  show  that  we  can  construct  a  proper  policy  by 
setting  TTdix)  =  a  for  any  such  a;  if  there  are  multiple  such  actions,  it  is  natural  to  pick  the 
one  with  highest  Pfy. 

In  summary,  monotone  lower  bounds  can  have  arbitrarily  bad  greedy  policies,  but 
greedy  policies  for  monotone  upper  bounds  do  at  least  as  well  as  the  bound.  Thus,  we 
believe  (and  our  experimental  results  demonstrate)  that  there  is  significant  advantage  to 
having  an  anytime  algorithm  that  returns  a  policy  that  is  greedy  with  respect  to  a  monotone 
upper  bound  on  the  value  function.  The  intuition  is  that  if  we  have  not  finished  planning 
and  must  return  some  non-optimal  plan  to  be  executed,  it  is  wise  to  be  pessimistic  about  re¬ 
gions  of  the  state  space  where  we  haven’t  done  much  work.  However,  during  the  planning 
phase  (that  is,  in  simulation),  being  optimistic  about  relatively  unexplored  regions  is  bene¬ 
ficial.  Thus,  BRTDP  gains  an  advantage  by  maintaining  both  an  upper  and  lower  bound  on 
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the  value  function.  The  question  of  how  to  efficiently  compute  an  initial  monotone  upper 
bounds  is  addressed  in  the  next  section. 

2.4.2  Monotonic  Upper  Bounds  in  Linear  Time 

Our  planning  algorithm,  BRTDP,  is  described  below  in  Section  2.4.  It  can  be  initialized 
with  any  upper  and  lower  bounds  and  ve  on  v*,  and  provides  performance  guarantees  if 
Vu  and  ve  are  monotone.  So,  we  need  to  compute  monotone  bounds  and  ve  efficiently. 
This  section  describes  how  to  do  so  assuming  we  can  afford  to  visit  every  state  a  small 
number  of  times;  Section  2.4.4  describes  looser  bounds  which  don’t  require  visiting  all  of 
S.  As  noted  above,  we  can  initialize  vg  to  the  value  of  the  deterministic  relaxation  Aid', 
so,  the  remainder  of  this  section  deals  with  v^. 

For  any  proper  policy  tt,  the  value  function  is  a  monotone  upper  bound.  A  proper 
policy  can  be  found  reasonably  quickly,  for  example  by  computing  from  the  determin¬ 
istic  relaxation.  Unfortunately,  directly  solving  the  linear  system  to  evaluate  tt  requires 
about  Cl(IS'p)  time  in  the  worst  case.^^  This  is  the  fastest  technique  currently  in  the  litera¬ 
ture  of  which  we  are  aware.  We  introduce  a  new  algorithm,  which  we  call  Dijkstra  Sweep 
for  Monotone  Pessimistic  Initialization  (DS-MPI),  which  can  compute  a  monotone  upper 
bound  in  elds'!  log  |S'|)  time. 

Suppose  we  are  given  a  policy  tt  along  with  Pgoai,  w  G  M+ '  that  satisfy  the  following 
property:  if  we  execute  tt  from  x  until  some  fixed  but  arbitrary  condition^^  is  met,  then 
w{x)  is  an  upper  bound  of  the  expected  cost  of  the  execution  from  x  until  the  stopping 
condition  is  met,  and  Pgoa.i{x)  is  a  lower  bound  on  the  probability  the  current  state  is  the 
goal  when  execution  is  stopped.  If  Pgoaiix)  >  0  and  w{x)  is  finite  for  all  x,  we  can  use  this 
information  to  derive  an  upper  bound  on  v*.  We  first  informally  motivate  this  derivation; 
then,  we  present  a  theorem  that  shows  that  with  some  additional  conditions  on  w  and  pgoai 
we  can  derive  a  monotone  upper  bound.  Finally,  we  give  an  efficient  algorithm  for  finding 
the  necessary  w  and  Pgoai- 

Imagine  executing  tt,  starting  from  some  state  x,  up  until  the  stopping  condition  is 
met.  This  costs  at  most  w{x)  in  expectation,  and  with  probability  at  least  Pgoai(a:)  we 
reach  the  goal.  But,  suppose  we  don’t  reach  the  goal,  and  instead  arrive  at  some  other 
state  y.  We  have  made  no  assumptions  about  tt,  and  so  y  might  be  the  “worst”  state  in  the 

^^For  some  particular  problems,  sparse  linear  solvers  or  iterative  methods  may  offer  better  performance. 

^^For  example,  we  might  execute  tt  for  t  steps;  or  execute  tt  until  we  reach  a  state  in  some  subset  of  S. 
Formally,  tt  can  be  an  arbitrary  (history-dependent)  policy,  and  the  stopping  condition  can  be  an  arbitrary 
function  0.  If  H  is  the  set  of  all  possible  histories  (trajectories),  then  Q  :  H  ^  {0, 1},  where  6{h)  =  1 
implies  stopping;  0  need  only  ensure  that  every  trajectory  stops  after  a  finite  number  of  steps 


40 


MDP.  Nevertheless,  we  can  re-start  our  execution  of  tt  from  y,  paying  an  additional  w{y) 
in  expectation,  and  reaching  the  goal  with  probability  at  least  Pgoai(|/)-  We  can  continue 
repeating  this  process,  and  because  (Vx)  Pgoai(a^)  >  0,  we  will  eventually  reach  the  goal. 

We  can  upper  bound  the  expected  cost  in  this  process  by  explicitly  considering  an 
adversary  that  after  each  execution  of  tt  (up  to  the  stopping  condition)  gets  to  teleport  the 
planning  agent  to  an  arbitrary  state  y  with  probability  1  — Pgoai (a^) ;  with  probability  pgoai {x) , 
the  process  ends.  We  model  this  interactions  as  an  MDP/or  the  adversary:  there  is  a  single 
non-goal  state,  and  one  action  for  each  state  in  the  original  problem,  corresponding  to  the 
destination  of  the  teleportation.  We  can  solve  this  single-state  MDP  by  computing  the 
optimal  value 

w(x) 

Ai  =  max - 

x£S  Pgoa.l\X) 

It  follows  that  in  the  original  MDP  A4,  for  all  x,  v*{x)  <  Ai.  For  any  particular  x,  we 
also  have  v*{x)  <  w{x)  -f  (1  —  Pgoai(a^))Ai:  the  value  of  a  state  can’t  be  any  worse  than 
following  TT  until  the  stopping  condition  is  met  (and  paying  w{x)),  and  then  ending  up  at 
the  worst  state  in  the  MDP  with  probability  1  —  Pgoa.i{x),  which  has  value  no  greater  than 
Ai.  Further,  if  we  are  given  some  Z  G  M  such  that  v*{x)  <  Z  for  all  x,  then  we  can  use 
Z  in  place  of  Ai  and  still  have  an  upper  bound  by  the  same  argument.  However,  without 
further  assumptions,^^  (based  on  Ai  or  Z)  need  not  be  monotonic. 

The  next  theorem  generalizes  this  idea  by  showing  how  it  can  also  be  used  to  find  a 
monotone  upper  bound,  if  w  and  pgoai  each  satisfy  certain  (monotonicity-like)  conditions. 

Theorem  2.4.3.  Suppose  Pgoai  and  w  satisfy  the  conditions  given  above  for  some  policy  n. 
Further,  suppose  for  all  x,  there  exists  an  action  a  such  that  either 

(^)  Pgoal{x)  <  PxyPgoidiy) 

y&S 

or 


(11)  w{x)  >  c(x,  a)  +  E  PxyW{y)  and 

y&S 

Pgoa\(x)  —  PxyPgoa.l(y)  ■ 

y&S 


Define  \  (x,  a)  by 


cjx,  a)  +  PSy'f^jy)  -  wjx) 
'Yhy^S  PxyPgo&liy)  ~  Pgoal(2^) 
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when  case  (I)  applies,  and  let  \{x,a)  =  0  when  case  (II)  applies,  and  \(x,a)  =  oo 
otherwise.  Then,  if  we  choose  A  =  max^jg^  iniiiaeA  A(a;,  a)  the  value  function  Vu(x)  = 
w(x)  +  (1  —  Pgoai(a^))A  is  a  finite  monotonic  upper  bound  on  v*. 

Proof  It  is  sufficient  to  show  that  all  Bellman  errors  for  Vu  are  positive,  that  is, 


(Vx) 


Vu(x)  —  min 


c(x,a)  +  ^P^yVu(y) 

y&s 


>  0. 


Plugging  in  the  definition  of  Vu  from  above  gives 

c{x,  a)  +  ^  P^y  (^w(|/)  +  (1  -  Pgoal(l/))A)  . 
y&s 

We  need  to  show  that  this  inequality  holds  for  all  x  given  A  as  defined  by  the  theo¬ 
rem.  It  is  sufficient  to  show  the  inequality  holds  for  at  least  one  action,  so  fix  some 
a'  G  argmin^  A(a;,  a).  By  assumption  \{x,a')  is  finite  and  hence  either  condition  (/) 
or  (//)  applies.  For  case  (I),  we  can  solve  the  above  inequality  for  A,  and  arrive  at 
A  >  A(a;,  a').  This  condition  is  satisfied  given  our  choice  a'  and  the  definition  of  A  from 
the  theorem.  If  case  (II)  holds,  any  A  >  0  will  satisfy  the  above  inequality.  It  follows  that 
Vu  monotone.  □ 

Now  we  will  show  how  to  construct  the  necessary  w,  Pgoai,  and  corresponding  tt.  The 
idea  is  simple:  suppose  state  xi  has  an  action  a  such  that  >  0.  Then,  we  can  set 
w{xi)  =  c{xi,  a)  and  Pgoaiixi)  =  Now,  consider  some  state  X2  that  has  an  action 

02  such  that  p  =  (P“2g  +  >  0-  Then,  we  can  set  Pgoai(a:2)  equal  to  p,  and 

w{x2)  =  c{x2, 02)  +  Pxa^gO  +  P^^^_^w{xi).  We  can  now  select  X3  to  be  any  state  with  an 
action  that  has  positive  probability  of  reaching  g,  xi,  or  X2,  and  we  will  be  able  to  assign 
it  a  positive  Pgoai-  The  policy  tt  corresponding  to  Pgoai  and  w  is  given  by  7r(xi)  =  a*,  and 
the  stopping  condition  ends  a  trajectory  whenever  a  transition  from  xi  to  Xj  occurs  with 
j  >  i.  The  Pgoai  and  w  values  we  compute  are  exact  values,  not  bounds,  for  this  policy  and 
stopping  condition. 

To  complete  the  algorithm,  it  remains  to  give  a  method  for  determining  what  state  to 
select  next  when  there  are  multiple  possible  states.  We  propose  the  greedy  maximization 
of  Pgoidixk):  having  fixed  xi, . . . ,  Xk-i,  select  (xk,  Ok)  to  maximize  Pf^Pgaaii^i)-  If 
there  are  multiple  states  that  achieve  the  same  Pgoaiixk),  we  choose  the  one  that  minimizes 
YiKk^xlxi'^i^i)-  Figure  (2.14)  gives  the  pseudocode  for  calculating  pgoai  and  w,  the 
queue  is  a  min  priority  queue  (with  priorities  in  which  are  compared  according  to 


(Vx)  w(a;)  -f  (1  -  Pgoai(a:))A  >  min 

a^A 
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Initialization; 

V(a;,  a),  Pgoai(a:,  a)  ^  0;  Va,  Pgoai(g,  a)  ^  1 
w{x,  a)  <—  c{x,  a) 

Pgoai,  w  initialized  arbitrarily 

Vx,  7r(a;)  ^  undefined;  7r(g)  ^  (arbitrary  action) 

Vx,  pri(a;)  ^  cx),  closed  (x)  ^  false 

Sweep: 

queue.enqueue(goal,  0) 
while  not  queue.emptyO  do 
X  ^  queue.popO 
closed  (x)  ^  true 

w{x)  ^  w{x,  7l{x)) 

Pgo^x)  ^  Pgoai(a:,7r(a;)) 

for  all  {y,  a)  s.t.{Py^  >  0)  and  (not  closed(|/))  do 
w{y,  a)  ^  w{y,  a)  +  Py^w{x) 

]3goai(|/,  a)  ^  Pgoai(|/,  a)  +  P“^Pgoai(a:) 

pri  ^  (1  -pgoai(|/,a),w(|/,a)) 

if  pri  <  pri(|/)  then 
pn{y)  ^  pri 

TT{y)  ^  a 

queue. decreaseKey(|/,  pri(|/)) 

end  if 
end  for 
end  while 


Figure  2.14:  The  DS-MPI  procedure. 


lexical  order),  and  pgoai  and  w  are  analogous  to  the  Q  values  for  v.  After  applying  the 
sweep  procedure,  one  can  apply  Theorem  2.4.3  to  construct  Vu- 

In  fact,  condition  (I)  or  (II)  will  always  hold  for  action  a^,  and  so  it  is  sufficient  to 
set  A  =  maxa^.gs' A(a;i,  Oj).  To  see  this,  consider  the  (xk^ak)  selected  by  DS-MPI,  after 
xi, . . .  ,Xk-i  have  already  been  popped  (i.e.,  fin(a;j)  =  true,  i  <  k).  Then,  Pgoai(a:fc)  = 
PSkXiPgoaiixi).  At  completion,  all  states  x  have  Pgoai(a:)  >  0,  and  so  the  only  way 
Pgoai(a:fc)  can  equal  Eyes  ^“fe\Pgoai(l/)  is  if  all  outcomes  y  G  succ(a;fc,afc)  were  closed 
when  Pgoaiixk)  was  set.  This  implies  that  J2i<k  =  Y.y^isPxty^iv)^  and  so 
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w{y)  =  c{x,  a)  +  J2y(^s  Px^y'^^v)  condition  (II)  holds.  Otherwise,  condition  (I)  must 
hold  for  {xk,  ttk).  Additional  backups  of  w  and  Pgoai  will  preserve  these  properties,  so  if 
extra  computation  time  is  available  or  the  pgoai  values  calculated  initially  are  too  small, 
additional  sweeps  will  tighten  the  upper  bound. 

If  the  dynamics  are  deterministic,  then  we  can  always  pick  {xk,  ak)  so  Pgoai(a^A:)  =  1, 
and  so  our  scheduling  corresponds  to  that  of  Dijkstra’s  algorithm.  This  sweep  is  similar 
to  the  policy  improvement  sweeps  done  by  the  Prioritized  Policy  Iteration  (PPI)  algorithm 
described  in  Section  2.3.  The  primary  differences  are  that  the  PPI  version  assumes  it  is 
already  initialized  to  some  upper  bound  and  performs  full  Q  updates,  while  this  version 
performs  incremental  updates. 

The  running  time  of  DS-MPI  is  0(|5'|  log  |S'|)  (assuming  a  constant  number  of  actions 
and  outcomes)  if  a  standard  binary  heap  is  used  to  implement  the  queue.  However,  an  un¬ 
scheduled  version  of  the  algorithm  will  still  produce  a  finite  (though  possibly  much  looser) 
upper  bound,  so  this  technique  can  be  run  in  C>(|S'|)  time.  If  no  additional  information  is 
available,  then  this  performance  is  the  best  possible  for  arbitrary  MDPs:  in  general  it  is 
impossible  to  produce  an  upper  bound  on  any  state  without  doing  C>(|5'|)  work,  since  we 
must  consider  the  cost  at  each  reachable  state. 

2.4.3  Bounded  RTDP 

The  pseudocode  for  Bounded  RTDP  is  given  in  Algorithm  2. 15.  BRTDP  has  four  primary 
differences  from  RTDP. 

•  It  maintains  upper  and  lower  bounds  Vu  and  Vi  of  v*,  rather  than  just  a  lower  bound. 
When  a  policy  is  requested  in  an  anytime  manner  (i.e.,  before  convergence),  the 
policy  greedy(ntj)  is  returned;  V£  helps  guide  exploration  in  simulation. 

•  When  trajectories  are  sampled  in  simulation,  the  outcome  distribution  is  biased  to 
prefer  transitions  to  states  with  a  large  gap  {vu{x)  —  ve{x)). 

•  BRTDP  maintains  a  list  of  the  states  on  the  current  trajectory,  and  when  the  trajec¬ 
tory  terminates  backups  are  done  in  reverse  order  along  the  stored  trajectory. 

•  Trajectories  terminate  when  they  reach  a  state  that  has  a  “well-known”  value,  rather 
than  when  they  reach  the  goal. 

We  assume  BRTDP  is  initialized  so  that  is  an  upper  bound,  and  Vi  is  a  lower  bound.  We 
defer  a  justification  of  this  assumption  to  the  next  section. 
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Main  loop: 

while  (wtj(s)  —  W£(s))  >  a  do 

runS  ampleTrialO 

end  while 

runSample  Trial: 

X  ^  s 

traj  4—  (empty  stack) 
while  true  do 
traj.push(a;) 

Vu{x)  4-  mmaQv^{x,a) 

a  ^  argmin^  {x,  a) 

V£{x)  ^  Q{x,  a) 

Vy,  b{y)  ^  P^y{vu{y)  -  ve{y)) 

B  ^  J2yb{y) 

i{{B  <  (w„(s)  —  V£{s))/t)  then  break 
X  4—  sample  from  distribution  b{y)/B 

end  while 

while  not  traj  .empty 0  do 
X  ^  traj.popO 

Vu{x)  4—  mmaQv^{x,a) 

ve{x)  ^  mmaQvi{x,a) 

end  while 


Figure  2.15:  The  bounded  RTDP  algorithm. 


Like  RTDP,  BRTDP  performs  backups  along  sampled  trajectories  that  begin  from  s. 
From  an  arbitrary  state  x  on  the  trajectory  a  greedy  action  a  is  selected  with  respect  to  V£. 
Let  b{y)  =  P^yivuiv)  —  xeiy)),  and  let  B  =  '^y^s  ^iv)-  Then,  BRTDP  samples  the  next 
state  on  the  trajectory  according  to  the  distribution  that  assigns  prob(|/)  =  b{y)/B. 

The  value  of  the  goal  state  is  known  to  be  zero,  and  so  we  assume  n„(g)  =  ^^(g)  =  0 
initially  (and  hence  always).  This  implies  that  6(g)  =  0,  and  so  our  trajectories  will  never 
actually  reach  the  goal.  It  is  natural  to  end  trajectories  when  a  “known”  state  is  reached. 
For  BRTDP,  “known”  corresponds  to  states  with  small  gap.  However,  the  smaller  the  gap 
the  less  likely  we  are  to  reach  the  state,  so  we  instead  look  at  the  expected  gap  under 
the  greedy  action  with  respect  to  the  unbiased  transition  probabilities.  The  normaliz- 
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ing  constant  B  has  exactly  this  interpretation,  so  we  terminate  the  trajectory  when  B  is 
small.  We  experimented  with  both  constant  thresholds  and  dynamic  ones,  and  the  differ¬ 
ent  choices  have  relatively  minor  impacts  on  performance.  We  found  the  adaptive  criterion 
B  <  (n„(s)  —  Vi{s))/T,  where  r  >  1  is  a  constant  (say  from  10  to  100),  to  be  as  good  as 
anything.  Figure  2.15  gives  the  complete  pseudocode  for  the  algorithm. 

A  convergence  proof  for  BRTDP  must  be  very  different  from  the  standard  one  for 
RTDP.  Proving  the  convergence  of  RTDP  typically  relies  on  the  claim  that  all  states  reach¬ 
able  under  the  greedy  policy  are  backed  up  infinitely  often  in  the  limit  [Bertsekas  and 
Tsitsiklis,  1996].  In  the  face  of  dense  noise,  this  implies  convergence  will  require  visit¬ 
ing  the  full  state  space.  We  take  convergence  to  mean  n„(s)  —  ^^(s)  <  a  for  some  error 
tolerance  a,  and  BRTDP  can  achieve  this  (given  a  good  initialization)  without  visiting  the 
whole  state  space  even  with  dense  noise.  A  detailed  proof  can  be  formed  by  establish¬ 
ing  that:  (1)  Vu  and  ve  remain  upper  and  lower  bounds  on  v*,  (2)  trajectories  have  finite 
expected  lengths,  and  (3)  every  trajectory  has  a  positive  probability  of  increasing  ve  or 
decreasing  Vu- 


2.4.4  Initialization  Assumptions  and  Performance  Guarantees 

We  assume  that  at  the  beginning  of  planning,  the  algorithm  is  given  Ai,  including  s.  As 
mentioned  in  Section  2.4.2,  if  this  is  the  only  information  available,  then  on  arbitrary 
problems  it  may  be  necessary  to  consider  the  whole  state  space  to  prove  any  performance 
guarantee. 

LRTDP,  HDP,  and  LAO*  can  converge  on  some  problems  without  visiting  the  whole 
state  space.  This  is  possible  if  there  exists  an  i?  c  S'  such  that  some  approximately  optimal 
policy  TT  has  /,r(|/)  >  0  only  for  y  E  E,  and  further,  a  tight  lower  bound  on  s  can  be  proved 
by  only  considering  states  inside  E  and  possibly  a  lower  bound  provided  at  initialization. 
While  some  realistic  problems  have  this  property,  many  do  not,  including  those  with  dense 
noise.  The  question,  then,  is  what  is  the  minimal  amount  of  additional  information  that 
might  allow  convergence  guarantees  while  only  visiting  a  small  fraction  of  S  on  arbitrary 
problems.  We  propose  that  the  appropriate  assumption  is  that  an  achievable  upper  bound 
(w°,7r°)  is  known;  here  is  some  upper  bound  (it  need  not  be  monotone)  on  (and 
hence  v*),  where  7r°  is  known.  Such  a  pair  is  almost  always  available  trivially,  for  example, 
by  letting  v^{x)  ^  Z  where  Z  is  some  worst-case  cost  of  system  failure,  and  letting  7r^{x) 
be  the  sit-and-wait-for-help  action,  or  something  similar.  Even  such  trivial  information 
may  be  enough  to  allow  convergence  while  visiting  a  small  fraction  of  the  state  space. 

^"^This  raises  the  issue  of  risk-sensitivity  in  planning.  A  one  in  a  million  chance  of  a  million  dollar  failure 
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5 

Vu{s) 

v*{s) 

Vd{s) 

A 

21559 

29 

63 

32 

23 

19 

B 

21559 

1.3el0 

26.9 

20.1 

19.9 

19.0 

C 

6834 

15283 

1642 

485 

176 

52 

D 

6834 

7662 

182.1 

117.1 

116.7 

63.0 

Table  2.2:  Test  problems  sizes  and  start-state  values. 

It  is  easiest  to  see  how  to  use  7r°)  via  a  transformation.  Consider  M'  =  (S',  A  U 
{0},  P,  c,  s,  g),  where  0  is  a  new  action  that  corresponds  to  switching  to  7r°  and  following 
it  indefinitely.  This  action  has  P^g  =  1.0  and  costs  c{x,  4>)  =  for  all  other  actions, 

P  =  P  and  c  =  c.  We  know  >  v*,  and  so  adding  the  action  <p  does  not  change 

the  optimal  values,  so  solving  Ai'  is  equivalent  to  solving  Af .  Suppose  we  run  BRTDP  on 
Af ',  but  extract  the  current  upper  bound  n*  before  convergence;  then,  v\^  need  not  be  mono¬ 
tone  for  Af,  though  it  will  be  for  Af'.  We  show  how  to  construct  a  policy  for  Af  using 
only  n*  that  achieves  the  values  n*.  At  a  state  where  n*  (x)  >  min^g^  Qvi{x,  a),  we  play 
the  greedy  action,  and  the  performance  guarantee  follows  from  the  standard  monotonic¬ 
ity  argument.  Suppose,  however,  we  reach  a  state  x  where  n*(a;)  <  mina^AQviix,  a). 
Then,  it  is  not  immediately  clear  how  to  achieve  this  value.  However,  we  show  that 
in  this  case  vl^{x)  =  v^{x),  and  so  we  can  switch  to  7r°  to  achieve  the  value.  Sup¬ 
pose  was  the  value  function  just  before  BRTDP  backed  up  x  most  recently.  Then, 
BRTDP  assigned  n*(a;)  ^  minaeAu^  Since  is  monotone  (for  Af',  on 

which  BRTDP  is  running),  {x,  a)  >  Qv^ix,  a),  and  so  the  only  way  we  could  have 
vl^{x)  <  mma^AQviix,a)  is  if  the  auxiliary  action  0  achieved  the  minimum,  implying 
=  vl{x). 

Thus,  we  conclude  that  via  this  transformation  it  is  reasonable  to  assume  BRTDP  is 
initialized  with  monotone  upper  bound,  implying  that  at  any  point  in  time  BRTDP  can  re¬ 
turn  a  stationary  policy  with  provable  performance  guarantees.  This  policy  will  be  greedy 
in  Af',  but  may  be  non- stationary  on  Af  as  it  may  fall  back  on  7r°.  This  potential  non¬ 
stationary  behavior  is  critical  to  providing  a  robust  suboptimal  policy. 


2.4.5  Experimental  Results 

We  test  BRTDP  on  two  discrete  domains.  The  first  is  the  4-dimensional  racetrack  do¬ 
main,  described  in  [Barto  et  ah,  1995,  Bonet  and  Geffner,  2003b,a,  Hansen  and  Zilber- 

might  be  acceptable  in  an  expected  sense,  but  it  might  be  preferable  to  pay  $2  for  insurance. 
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stein,  2001].  Problems  (A)  and  (B)  are  from  this  domain,  and  use  the  large -b  racetrack 
map  [Bonet  and  Geffner,  2003a].  Problem  (A)  fixes  a  0.2  probability  of  getting  the  zero 
acceleration  rather  than  the  chosen  control,  similar  to  test  problems  from  the  above  refer¬ 
ences.  Problem  (B)  uses  the  same  map,  but  uses  a  dense  noise  model  where  with  a  0.01 
probability  a  random  acceleration  occurs.  Problems  (C)  and  (D)  are  from  a  2D  gridworld 
problem,  where  actions  correspond  to  selecting  target  cells  within  a  Euclidean  distance  of 
two  (giving  13  actions).  Both  instances  use  the  same  map.  In  (C),  the  agent  accidentally 
targets  a  state  up  to  a  distance  2  from  the  desired  target  state,  with  probability  0.2.  In  (D), 
however,  a  random  target  state  (within  distance  2)  is  selected  with  probability  0.01.  Note 
that  problems  (A)  and  (C)  have  fairly  sparse  noise,  while  (B)  and  (D)  have  dense  noise. 

Figure  2.2  summarizes  the  sizes  of  S  for  the  test  problems.  The  other  columns  provide 
information  to  enable  an  evaluation  of  the  DS-MPI.  The  VTr^{s)  column  gives  the  value 
of  the  start  state  under  a  policy  derived  from  solving  the  deterministic  relaxation  (see 
Section  2.4.1).  The  next  column,  Vu{s),  gives  the  value  computed  via  DS-MPI.  We  let 
vr'  =  greedy(n„),  and  give  n7r'(s).  This  data  shows  that  DS-MPI  can  produce  high-quality 
upper  bounds  that  have  high-quality  greedy  policies,  despite  running  in  (!1(  |  S']  log  |  S'] )  time 
rather  than  the  C>(|S'p)  needed  to  compute  The  final  column  gives  W7r^(s),  the  value 
of  the  start-state  under  the  value  function  of  the  deterministic  relaxation. 


Anytime  Performance 

We  compare  the  anytime  performance  of  BRTDP  to  RTDP  on  the  test  domains  listed  in 
Figure  2.2,  considering  both  informed  initialization  and  uninformed  initialization  for  both 
algorithms.  Informed  initialization  means  RTDP  has  its  value  function  initially  set  to  Vd, 
and  BRTDP  has  set  to  Vd  and  set  by  running  DS-MPI.  For  uninformed  initialization, 
RTDP  has  Vu  set  uniformly  to  zero,  and  BRTDP  has  vg,  set  to  zero  and  Vu  set  to  10®. 

Figure  2.16  gives  anytime  performance  curves  for  the  algorithms  on  each  of  the  test 
problems.  There  are  many  possible  models  of  online  interaction  between  a  planner  that 
produces  policies  and  an  actor  that  executes  them.  In  general,  this  interaction  can  be 
quite  complex.  We  adopt  the  precursor  deliberation  [Dean  et  ah,  1995]  or  anytime  model, 
wherein  we  interrupt  each  algorithm  at  fixed  intervals  to  consider  the  quality  of  the  policy 
available  at  that  time.  Rather  than  simply  evaluating  the  current  greedy  policy,  we  assume 
the  executive  agent  has  some  limited  computational  power  and  can  itself  run  RTDP  on  a 
given  initial  value  function  received  from  the  planner.  (This  assumption  results  in  a  fairer 
comparison  for  RTDP,  since  that  algorithm’s  greedy  policy  may  be  improper.)  To  evaluate 
a  value  function  v,  we  place  an  agent  at  the  start  state,  initialize  its  value  function  to  v,  run 
RTDP  until  we  reach  the  goal,  and  record  the  cost  of  the  resulting  trajectory.  The  curves  in 
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Figure  2.16:  Anytime  performanee  of  informed  and  uninformed  RTDP  and  BRTDP:  the 
first  row  is  for  the  informed  initialization,  and  the  seeond  for  uninformed.  The  X-axis 
gives  number  of  baekups  (x  10^),  and  the  V  axis  indieates  the  eurrent  value  of  the  poliey; 
F-axis  labels  are  negative  eosts,  so  higher  numbers  are  better.  Note  the  differenees  in 
seales. 


Figure  2.16  are  the  result  of  100  separate  runs  of  eaeh  algorithm,  with  eaeh  value  funetion 
evaluated  using  200  repetitions  of  the  above  testing  proeedure. 

BRTDP  performs  4  baekups  for  eaeh  state  on  the  trajeetories  it  simulates:  one  eaeh 
on  Vu  and  V£  during  forward  simulation,  and  one  eaeh  while  traversing  the  trajeetory  in 
reverse  order.  RTDP  performs  only  one  baekup  per  sampled  state.  This  gives  BRTDP 
lower  overhead  and  better  eaehe  performanee  per  baekup,  and  on  the  test  problems  we 
observed  it  computed  1.5  to  3  times  more  backups  per  unit  of  runtime  than  RTDP.  Thus, 
if  Figure  2.16  was  re-plotted  with  time  as  the  X-axis,  the  performanee  of  BRTDP  would 
appear  even  stronger.  So,  in  interpreting  the  results  from  this  seetion  one  should  realize 
we  have  handieapped  BRTDP  in  two  ways:  we  eompare  it  to  RTDP  in  terms  of  number 
of  updates  rather  than  CPU  time,  and  we  evaluate  RTDP-trajeetories  rather  than  stationary 
polieies,  even  though  stationary  polieies  taken  from  BRTDP  have  provable  guarantees. 

Several  eonelusions  ean  be  drawn  the  results  in  Figure  2.16.  First,  appropriate  initial¬ 
ization  provides  signifieant  help  to  both  RTDP  and  BRTDP.  Seeond,  under  both  types  of 
initialization,  BRTDP  often  provides  mueh  higher-reward  polieies  than  RTDP  for  a  given 
number  of  baekups  (espeeially  with  a  small  number  of  baekups,  and  espeeially  with  infor¬ 
mative  initialization),  and  we  never  observed  its  polieies  to  be  mueh  worse  than  RTDP.  In 
partieular,  observe  that  on  problems  (C)  and  (D)  BRTDP  is  nearly  optimal  from  the  very 
beginning.  This,  eombined  with  the  faet  that  BRTDP  provides  performanee  bounds  even 
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Problem  Problem 


Figure  2.17:  CPU  time  required  for  convergence  with  informed  (left)  and  uninformed 
(right)  initialization  of  the  algorithms. 


for  stationary  policies,  make  BRTDP  a  very  attractive  option  for  anytime  applications. 


Off-line  Convergence 

We  compare  off-line  convergence  times  for  BRTDP  to  those  of  LRTDP  and  HDP.'^  Again, 
we  consider  both  informed  and  uninformed  initialization.  Informed  LRTDP  and  HDP  have 
their  value  functions  initialized  to  Vd,  while  uninformed  initialization  sets  them  to  zero. 
Time  spent  computing  informed  initialization  values  is  not  charged  to  the  algorithms;  this 
time  will  be  somewhat  longer  for  BRTDP  as  it  also  uses  an  upper  bound  heuristic,  however, 
this  time  is  typically  dominated  by  the  algorithm  runtime. 

We  evaluate  the  algorithms  by  measuring  the  time  it  takes  to  find  an  a-optimal  partial 
policy.  For  BRTDP,  since  we  maintain  upper  and  lower  bounds,  we  can  simply  termi¬ 
nate  when  {vu{s)  —  Vi{s))  <  a;  we  used  a  =  0.1  in  our  experiments.  As  discussed  in 
Section  2.4.2  we  initialized  Vu  to  a  monotone  upper  bound,  so  the  greedy  policy  with  re¬ 
spect  to  the  final  Vu  will  be  within  a  of  optimal.  The  other  tested  algorithms  measure 
convergence  by  stopping  when  the  max-norm  Bellman  error  drops  below  some  tolerance 
e.  Without  further  information  there  is  no  way  to  translate  e  into  a  bound  on  policy  quality: 
we  can  incur  an  extra  cost  of  e  at  each  step  of  our  trajectory,  but  since  our  trajectory  could 
have  arbitrarily  many  steps  we  could  be  arbitrarily  suboptimal  by  the  end.  To  provide  an 
approximately  equivalent  stopping  criterion,  we  used  the  following  heuristic:  pick  an  opti¬ 
mal  policy  TT*  and  let  i*{x)  be  the  expected  number  of  steps  to  reach  g  from  x  by  following 
71*.  Then  take  e  =  a/t{s).  This  heuristic  yielded  e  =  0.004,  0.005,  0.001,  and  0.002  for 

^^Improved  LAO*  is  very  similar  to  HDP  without  labeling  solved  states,  and  [Bonet  and  Geffner,  2003a] 
shows  HDP  has  generally  better  performance,  so  LAO*  was  not  considered  in  our  experiments. 
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problems  (A)  through  (D). 

As  expected,  on  (B)  and  (D),  the  problems  with  dense  noise,  BRTDP  significantly 
outperformed  the  other  algorithms.  On  (D),  uninformed  BRTDP  is  3.2  times  faster  than 
uninformed  HDP,  and  informed  BRTDP  is  6.4  times  faster  than  informed  HDP  Unin¬ 
formed  BRTDP  outperforms  informed  HDP  on  (D)  by  a  factor  of  1.8.  More  importantly, 
on  (B)  and  (D)  HDP  and  LRTDP  visit  all  of  S  before  convergence,  while  (for  example) 
on  (B),  informed  BRTDP  visits  28%  of  S  and  only  brings  \vu{x)  —  ve{x)\  <  a  for  10% 
of  S.  If  we,  for  example,  add  additional  states  beyond  those  BRTDP  does  not  visit,  the 
performance  gap  will  become  arbitrarily  large. 
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Chapter  3 

Bilinear-payoff  Convex  Games 


Convex  games  generalize  zero-sum  matrix  games  by  allowing  arbitrary  convex  strategy 
sets  in  the  place  of  explicitly  enumerated  finite  strategy  sets.  This  very  general  framework 
can  compactly  represent  large  games  with  sequential  decisions.  For  example,  extensive- 
form  games  can  be  represented  compactly  in  this  way;  but,  so  can  games  with  other  kinds 
of  structure,  including  path-planning  games  with  uncertain  outcomes  and  adversary  con¬ 
trolled  costs,  and  routing  problems  with  adversary-controlled  demands.  In  this  chapter, 
we  define  the  convex  game  model,  introduce  notation,  and  describe  previous  theoretical 
results  on  convex  games.  To  demonstrate  the  utility  of  the  framework,  we  then  discuss 
three  games  that  can  be  modeled  as  convex  games,  and  also  show  how  we  can  generalize 
stochastic  games  using  convex  games. 

The  representational  power  of  convex  games  makes  algorithms  for  their  solution  par¬ 
ticularly  important.  It  was  shown  by  Roller  et  al.  [1994]  that  polyhedral  convex  games  can 
be  solved  via  linear  programming  (that  work  focuses  on  the  application  to  extensive-form 
games,  but  the  formulation  in  fact  holds  for  general  convex  games).  Since  that  seminal  re¬ 
sult,  the  reduction  to  linear  programming  has  been  the  state  of  the  art  for  solving  this  class 
of  problems.  For  example,  sophisticated  game-abstraction  techniques  combined  with  lin¬ 
ear  programming  only  recently  allowed  for  the  exact  solution  of  Rhode  Island  Hold ’em 
poker,  a  simplified  version  of  the  standard  game  of  heads  up,  limit  Texas  Hold’em.  Even 
after  the  application  of  the  equilibria-preserving  abstraction,  solving  the  corresponding 
linear  program  exactly  took  over  7  days  of  CPU  time  and  25  GB  of  memory  [Gilpin  and 
Sandholm,  2005]. 

In  fact,  convex  games  often  have  significant  structure  that  is  not  exploited  by  general- 
purpose  linear  programming  algorithms.  One  way  such  structure  can  be  exploited  is 
through  fast  algorithms  for  calculating  a  best  response  strategy  to  a  fixed  strategy  of 
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the  opponent.  The  elassie  fietitious  play  algorithm  takes  advantage  of  such  oracles,  and 
demonstrates  remarkably  good  performance  on  Rhode  Island  Hold’em.  In  Chapter  5,  we 
develop  a  new  algorithm  for  solving  convex  games  that  also  uses  best-response  oracles; 
it  outperforms  fictitious  play  and  dramatically  outperforms  the  direct  application  of  linear 
programming  techniques. 

While  convex  games  are  a  straight-forward  generalization  of  matrix  games,  the  ability 
to  represent  arbitrary  convex  strategy  sets  lets  us  take  advantage  of  structure  in  many  types 
of  games,  often  yielding  exponentially  smaller  representations.  In  the  coming  sections,  we 
will  consider  three  examples  that  illustrate  this  point: 

•  As  previously  mentioned,  extensive-form  games  (EFGs)  can  be  transformed  to  con¬ 
vex  games.  While  there  are  typically  exponentially  many  (in  the  size  of  the  game 
tree)  pure  strategies  for  an  EFG,  the  set  of  behavioral  strategies  can  be  represented 
concisely  as  a  convex  set  of  achievable  sequence  weight  vectors. 

•  The  well-studied  problem  of  computing  an  optimal  oblivious  routing  can  in  fact 
be  expressed  as  convex  game.  In  this  game,  one  player  picks  a  routing  in  a  network 
and  the  other  picks  traffic  demands  on  source-sink  pairs,  subject  to  some  constraints. 
There  are  exponentially  many  deterministic  routings  (pure  strategies  in  the  matrix 
game),  but  again  there  is  a  concise^  representation  of  the  set  of  strategies  as  a  poly¬ 
hedron.  The  details  of  expressing  this  problem  as  a  convex  game  follow  from  work 
by  Azar  et  al.  [2003],  though  they  did  not  connect  their  work  to  the  convex  game 
model  and  in  fact  a  polynomial- sized  EP  formulation  did  not  appear  until  [Apple- 
gate  and  Cohen,  2003].  The  observation  that  optimal  oblivious  routing  is  a  convex 
game  is  new,  and  the  algorithms  presented  in  Chapter  5  may  be  of  practical  interest 
for  this  problem. 

•  In  cost-paired  MDP  games,  each  player  selects  a  stochastic  policy  in  an  MDP, 
and  their  choice  determines  the  costs  in  the  opponent’s  MDP.  The  set  of  strategies 
(stochastic  policies  in  the  MDPs)  for  each  player  has  a  polynomial-sized  represen¬ 
tation  as  a  polyhedron,  but  there  are  exponentially  many  deterministic  policies  and 
so  the  corresponding  matrix  game  is  exponential  in  both  rows  and  columns. 

We  also  show  how  convex  games  can  be  used  to  generalize  stochastic  games.  Stochastic 
games  extend  MDPs  to  multiple  players  by  embedding  a  matrix  game  at  each  state  in 
an  MDP;  the  next  state  distribution  and  cost  to  each  player  depends  on  the  joint  action 

'The  size  of  the  representation  of  the  constraints  is  polynomial  in  the  size  of  the  representation  of  the 
problem. 
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selected.  In  Section  3.5  we  show  how  these  matrix  games  can  be  replaced  by  convex 
games;  this  allows  the  embedding  of  extensive-form  games  at  the  nodes  of  stochastic 
games,  creating  a  tractable  class  of  partially-observable  stochastic  games  (POSGs). 

As  these  examples  demonstrate,  fast  algorithms  for  convex  games  have  far-reaching 
applicability.  Before  considering  these  examples  in  greater  depth,  we  first  establish  some 
technical  details  about  convex  games. 


3.1  From  Matrix  Games  to  Convex  Games 

A  zero-sum  matrix  (normal-form)  game  is  played  by  two  players,  player  row  with  strate¬ 
gies  R  =  m}  and  player  column  with  strategies  G  =  {1, . . . ,  n}.  Am  x  n  matrix 

M  specifies  the  payoffs,  so  that  if  row  plays  strategy  i  G  R  and  column  plays  j  e  C, 
the  payment  from  row  to  column  is  the  {i,j)th  entry  of  M,  denoted  The  play¬ 

ers  select  their  strategies  simultaneously,  without  knowledge  of  the  other  player’s  choice. 
We  restrict  our  attention  to  two-player,  zero-sum  games  as  that  is  the  simplest  case  and 
the  one  most  relevant  to  our  solution  techniques;  however,  n-player  general-sum  matrix 
games  (and  convex  games)  can  be  defined  analogously. 

We  use  A(-)  to  denote  the  probability  simplex  over  a  finite  set,  so  for  example 

p{i)  =  1  and  p{i)  >  o|  . 

A  mixed  strategy  is  an  element  p  G  A(i?)  for  the  row  player  or  g  G  A(G)  for  the  col¬ 
umn  player,  corresponding  to  a  distribution  over  the  rows  or  columns,  respectively.  If  the 
players  select  mixed  strategies  p  and  q,  the  expected  payoff  V {p,  q)  from  row  to  column  is 

V{p,q)  =  =p^Mq. 

A  solution  to  the  game  is  given  by  a  minimax  equilibrium  (p*,  g*),  a  pair  of  mixed  strate¬ 
gies  such  that  neither  player  has  an  incentive  to  play  differently  given  that  the  other  player 
plays  their  strategy  from  the  pair.  The  minimax  theorem  says  that  if  the  players  are  allowed 
to  select  mixed  strategies,  there  is  no  advantage  to  playing  second.  That  is, 

min  max  My  =  max  min  My.  (3.1) 

xGA(iJ)  ?;6A(C)  ?;6A(C)  x&A(R) 

Thus,  solving  either  the  min  max  or  max  min  optimization  problem  from  (3.1)  finds  a 
minimax  equilibrium  for  the  matrix  game.  This  problem  can  easily  be  converted  to  a 
linear  program  and  solved  via  standard  techniques. 


A{R)  =  {pe 


m 

E 

2=1 


55 


An  e-approximate  minimax  equilibrium  for  a  matrix  game  is  a  pair  of  strategies  {p',  q') 
where  neither  player  can  gain  more  than  e  value  by  switching  to  some  other  strategy.  That 
is, 


Vip'x') 

<  min  V (p,  q')  -f  e 

peA{ij) 

(3.2) 

V{p',q') 

>  max  V (o',  q)  —  e. 

<?6A(C) 

(3.3) 

Note  that  if  e  =  0  we  have  an  exact  minimax  equilibrium. 

Two-player  zero-sum  bilinear-payoff  convex  games  are  a  natural  generalization  of  ma¬ 
trix  games;  we  will  simply  refer  to  this  class  as  “convex  games”  for  the  sequel.  This 
formulation  was  first  introduced  by  Dresher  and  Karlin  [1953],  but  convex  games  have 
received  remarkably  little  attention  in  the  literature  considering  the  generality  and  useful¬ 
ness  of  the  framework.  One  of  the  goals  of  this  chapter  is  to  highlight  several  interesting 
special  cases  of  convex  games,  and  suggest  that  the  class  deserves  much  greater  attention 
from  an  algorithmic  perspective. 

Convex  games  allow  arbitrary  convex  sets  X  and  Y  in  place  of  the  probability  sim- 
plices  A(i?)  and  A(C)  for  matrix  games.  A  convex  game  is  specified  by  a  tuple  (A,  K,  M) 
where  X  C  MX  and  F  C  are  the  strategy  sets  for  the  two  players,  and  M  is  am  x  n 
payoff  matrix.  The  first  player  (who  we  will  call  x)  selects  a  action  x  E  X,  the  second 
player  (called  y)  simultaneously  chooses  y  G  Y,  and  the  payoff  from  player  x  to  player  y 
is  given  by 

V{x,y)  =  x^My. 

The  concepts  of  equilibria  and  e-approximate  equilibria  naturally  generalize  to  convex 
games.  Throughout  the  thesis,  we  assume  all  convex  action  sets  (A  and  Y  in  this  case) 
are  nonempty;  for  simplicity,  we  generally  also  assume  A  and  Y  are  bounded,  though 
this  restriction  can  often  be  removed  or  easily  enforced.^  This  insures  that  the  games  we 
consider  have  finite  value  (that  is,  inf^.  V {x,  y)  and  sup^.  V (x,  y)  are  finite). 

A  polyhedron  is  a  convex  set  defined  by  a  finite  number  of  linear  equality  and  in¬ 
equality  constraints.  Generally,  we  will  represent  these  constraints  in  matrix  notation,  for 
example, 

X  =  {x  \  Ax  =  b,  a;  >  0} 

^For  example,  if  X  is  the  set  of  stochastic  policies  for  an  MDP  represented  as  state-action  visitation  fre¬ 
quencies,  we  can  add  an  additional  constraint  to  prohibit  policies  that  loop  indefinitely  (making  X  bounded). 
However,  often  it  is  sufficient  to  simply  show  that  these  policies  are  never  optimal  (e.g.,  in  the  case  of  positive 
costs)  and  so  have  no  impact  on  the  optimization. 
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Matrix  Games 

Convex  Games 

Pure  (single  row) 

Pure  (an  extreme  point  of  X) 

Mixed  (distribution  over  rows) 

Implicit  Mixed  (any  point  in  X) 
Explicit  Mixed  (distribution  over  X) 

Table  3.1:  Strategy  classes  for  matrix  and  convex  games. 


for  a  suitable  matrix  A  and  vector  v.  We  say  a  convex  game  is  polyhedral  if  X  and  Y  are 
polyhedra.  Polyhedral  convex  games  can  be  solved  in  polynomial  time  via  linear  program¬ 
ming:  though  the  term  “convex  game”  was  not  used,  the  efficient  solution  technique  for 
zero-sum  EFGs  presented  in  [Koller  et  ah,  1994]  essentially  depends  on  a  transformation 
from  an  EFG  to  an  equivalent  convex  game.  Koller  et  al.  then  show  how  to  solve  the 
resulting  convex  game  efficiently.  We  review  their  solution  method  here,  as  in  the  next 
chapter  we  present  a  non-trivial  extension  of  their  technique  which  solves  a  generalization 
of  EFGs. 

Consider  a  matrix  game  defined  by  payoff  matrix  M,  and  the  corresponding  polyhe¬ 
dral  convex  game  {X,Y,M)  where  X  =  X{R)  and  Y  =  A(G).  We  write  Cn(Z)  to 
denote  the  extreme  points  (corners)  of  an  arbitrary  polyhedron  Z.  Note  that  Cn(Z)  is  a 
finite  set,  however,  in  general  its  size  may  be  exponential  in  the  size  of  the  representation 
of  Z.  However,  since  X  is  a  probability  simplex,  we  have  |Cn(X)|  =  m,  and  there  is 
a  natural  mapping  between  Cn(X)  and  R  (and  similarly  between  Cn(F)  and  C):  each 
corner  of  X  corresponds  to  the  probability  distribution  that  picks  a  particular  row  of  M 
deterministically.  Interior  points  of  X  correspond  to  mixed  strategies  in  the  matrix  game. 

Table  (3.1)  shows  the  different  classes  of  strategies  that  we  consider  for  matrix  and 
convex  games.  Based  on  the  analogy  to  matrix  games,  we  call  the  strategies  in  Cn(X) 
the  pure  strategies,  while  we  call  strategies  from  the  full  set  X  (including  interior  points) 
implicit  mixed  strategies.  This  is  natural  given  that  if  X  =  A(i?),  the  interior  points  of  X 
correspond  to  mixed  strategies.  An  explicit  mixed  strategy  is  given  by  a  distribution  over 
some  subset  of  X  (possibly  given  as  a  probability  density  on  all  of  X,  or  perhaps  simply  a 
distribution  on  the  extreme  points  of  X). 

It  would  be  equally  reasonable  to  term  an  interior  point  of  X  a  pure  strategy,  as  it  is  a 
single  strategy  drawn  from  the  set  X  of  possible  strategies.  In  this  case,  a  mixed  strategy 
(distribution  over  pure  strategies)  would  be  what  we  call  an  explicit  mixed  strategy.  We 
use  the  more  precise  terms  pure,  implicit  mixed,  and  explicit  mixed  to  avoid  these  possible 
ambiguities. 
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The  interpretation  of  these  types  of  strategies  depends  on  the  nature  of  the  convex 
game.  In  particular,  in  some  games  the  interior  points  of  X  can  be  considered  primitive 
actions  (actions  that  can  be  implemented  in  the  world  directly),  but  in  others  the  comers 
are  the  only  primitive  actions,  and  interior  points  must  be  interpreted  as  probability  dis¬ 
tributions  over  these.  We  will  have  more  to  say  about  the  interpretation  of  point  in  X 
later  in  the  section;  it  will  turn  out  these  distinctions  do  not  matter  from  an  optimization 
viewpoint. 

Next,  we  will  show  how  to  solve  polyhedral  convex  games  via  linear  programming,  and 
simultaneously  prove  the  minimax  theorem  for  polyhedral  convex  games.  This  minimax 
result  is  in  fact  achieved  by  implicit  mixed  strategies;  we  will  go  on  to  show  that  neither 
player  has  any  incentive  to  play  explicit  mixed  strategies. 


3.1.1  Solution  via  Convex  Optimization,  and  the  Minimax  Theorem 

The  method  for  solving  convex  games  described  in  this  section  is  due  to  Roller  et  al. 
[1994];  Von  Stengel  [2002]  gives  a  more  detailed  presentation  with  examples.  We  consider 
the  case  where  where  X  and  Y  are  polyhedra,  X  =  {x  \  Ax  =  b,  x  >  0}  and  Y  =  {y  \ 
Cy  =  d,y  >  0},  but  the  result  can  be  extended  to  general  convex  sets.  Suppose  player 
X  announces  he  will  play  a  fixed  strategy  x  G  X.  Then,  we  can  find  a  best  response  for 
player  y  by  solving: 


The  dual  of  this  LP  is 


max  (x'^M)y 

y 

subject  to  Cy  =  d 

f/>0 


min  q^d 

subject  to  q^C  >  x^ M. 


Strong  duality  holds,^  so  the  values  of  the  two  programs  are  equal  for  all  x.  Thus,  we 
can  solve  the  game  where  first  x  picks  a  strategy  x  and  then  y  observes  this  and  picks  a 

^This  is  ensured  because  X  and  Y  are  bounded,  nonempty  polyhedra,  but  a  direct  proof  (perhaps  using 
Slater’s  constraint  qualifications  [see  Boyd  and  Vandenberghe,  2004,  Section  5.2.3])  may  be  necessary  in 
the  case  of  general  convex  X  and  Y . 
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response  y  via  the  following  program: 

max 
y 

subjeet  to 


min 

9 

subjeet  to 

q^d  (3.4) 

q^C  >  x^M. 

Ax  =  b 
X  >0. 

By  an  analogous  argument  we  ean  solve  the  game  where  y  plays  first  by 

max  min  x'^My  =  max  p^b  (3.5) 

y^Y  xGiX  y,p 

subjeet  to  A^p  <  My 
Cy  =  d 
y>0- 

It  is  straightforward  to  verify  that  LP  (3.5)  is  in  faet  the  dual  of  LP  (3.4),  and  strong  duality 
then  gives  the  minimax  theorem  for  eonvex  games: 

min  max  x'^My  =  max  min  x^ My.  (3.6) 

xex  y&  y&Y  xGX 

Thus,  we  ean  solve  the  eonvex  game  by  eonstrueting  the  linear  program  from  either  Equa¬ 
tion  (3.4)  or  Equation  (3.5)  and  applying  any  standard  linear  programming  solver.  It  is 
worth  noting  that  the  EP  (3.4)  ean  also  be  expressed  as 

min  A 

subjeet  to  x  e  X 

A  >  {x'^M)y  for  all  y  eY.  (3.7) 


{x^M)y 

Cy  =  d 
y>o 

q^d 

q^C  >  x^M_ 


min  max  x'^My  =  min 

x&X  y&Y  x&X 


and  substituting  the  dual  for  the  primal. 


=  min 

x(iX 


This  then  simplifies  to  the  EP 

min 

x,q 

subjeet  to 
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For  general  Y  we  can  solve  this  problem  via  the  Ellipsoid  algorithm  by  using  an  opti¬ 
mization  over  Y  with  the  fixed  linear  objective  function  M  to  detect  violations  of  con¬ 
straints  (3.7)  at  the  current  (x,  A).  This  is  important  because  the  optimization  for  fixed 
x'^M  may  be  efficiently  solvable  using  a  domain-specific  algorithm;  we  will  show  in 
Chapter  5  that  such  best-response  oracles  play  a  central  role  in  designing  fast  algorithms 
for  convex  games. 


Mixed  strategies  are  equivalent  to  interior  points  We  have  shown  that  a  minimax 
equilibria  exists  when  both  players  choose  from  the  sets  X  and  Y  directly,  without  ran¬ 
domization.  But  perhaps  one  of  the  players  could  do  better  by  playing  an  explicit  mixed 
strategy?  In  fact,  the  answer  is  no.  It  is  sufficient  to  show  that  even  if  a  player  (say,  x) 
goes  first  and  announces  his  strategy,  he  has  no  reason  to  announce  a  distribution  over  X 
(explicit  mixed  strategy)  rather  than  a  single  x  E  X  (implicit  mixed  strategy).  This  result 
is  not  an  immediate  consequence  of  the  minimax  theorem  (Equation  (3.6)),  because  that 
statement  assumes  both  players  are  limited  to  playing  only  implicit  mixed  strategies. 

Suppose  X  plays  first  and  selects  an  explicit  mixed  strategy  given  by  p  and  X,  where 
X  =  {xi, ..  .Xk}  is  a  finite  subset  of  X  and pi  is  the  probability  she  selects  Xi  e  X  (the 
case  where  x  selects  a  probability  density  over  all  points  in  X  is  similar).  Eet  x  = 
this  point  is  in  X  by  the  definition  of  convexity.  Player  y  is  told  that  player  x’s  choice  is 
{p,  X)  and  then  he  selects  a  best  response.  Thus,  the  expected  payoff  from  x  to  y  is: 

ElV]^ 


max  a:  Pr(x  plays  Xi)V{xi,y) 

i 

max  >  Pi  (xj 

y&Y  ^  ^  * 

i 

max  piXi  My 

max  x"^  My.  (3.8) 

y&Y 


Note  that  (3.8)  is  exactly  the  payoff  if  x  had  announced  x  G  X  instead  of  the  explicit 
mixed  strategy  given  by  p.  This  implies  neither  player  can  get  a  better  payoff  by  choosing 
an  explicit  mixed  strategy.  There  may  be  many  different  weights  p  that  represent  the  point 
X.  Thus,  X  may  be  viewed  as  defining  an  equivalence  class  of  payoff-equivalent  explicit 
mixed  strategies.  See  Eigure  (3.1)  for  an  example;  this  figure  is  discussed  in  detail  in  the 
next  section. 
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Interpretations  of  the  action  sets  How  should  we  interpret  the  convex  set  of  actions 
X?  We  consider  two  possibilities.  First,  we  may  view  every  element  x  E  X  as  a  primitive 
(playable)  action  that  can  be  selected  and  then  directly  implemented  in  the  world.  For 
example,  x  might  correspond  to  an  allocation  of  money  among  m  different  investments, 
subject  to  some  constraints. 

On  the  other  hand,  consider  the  case  of  a  matrix  game  represented  as  a  convex  game. 
We  can  solve  the  convex  game  via  the  linear  program  (3.4).  Then,  the  optimal  feasible 
point  X  will  (in  general)  correspond  to  an  interior  point  of  the  set  X.  Since  X  is  a  proba¬ 
bility  simplex,  |Cn(X)|  =  dim(X)  =  m  and  we  can  naturally  interpret  a;  as  a  probability 
distribution  over  Cn(X)  and  hence  R.  To  “play”  the  matrix  game,  we  can  sample  from 
this  distribution  and  play  that  corner  (row). 

There  are  other  cases  where  the  set  X  is  a  polyhedron  and  only  the  comers  Cn(X) 
are  primitive  actions.  For  example,  the  set  of  stochastic  policies  for  an  MDP  can  be  rep¬ 
resented  as  a  convex  set,  and  the  corners  Cn(X)  correspond  to  the  deterministic  policies; 
we  will  discuss  this  example  in  detail  in  Section  3.4.3.  In  this  case,  |Cn(X)|  can  be  expo¬ 
nentially  larger'^  than  dim(X),  and  so  even  explicitly  representing  an  arbitrary  distribution 
over  Cn(X)  may  be  infeasible.  But,  if  the  definition  of  the  game  requires  we  select  an 
extreme  point  as  an  action,  how  do  we  interpret  the  interior  point  xl  Fortunately,  we  have 
the  following  representation  theorem  (this  version  is  from  Bazaraa  et  al.  [1990]): 

Theorem  3.1.1.  Any  bounded  polyhedron  X  C  has  a  finite  set  of  extreme  points 
(corners),  say  Cn(X)  =  {xi, . . .  ,Xk}.  Any  x  E  is  a  member  of  X  if  and  only  if 
there  exists  p  E  with  Pi  >  0  and  YhiPi  —  ^  (that  is,  p  E  A(Cn(X)))  such  that 
X  =  Yli=i  Pi^i-  Further,  there  always  exists  a  representation  such  that  no  more  than  m  +  1 
of  the  Pi  coefficients  are  non-zero  and  such  a  representation  can  be  found  in  polynomial 
time. 

Thus,  to  play  a  “corners  only”  convex  game  we  can  solve  the  LP  formulation  to  find  an 
interior  point  solution,  generate  a  small-support  distribution  over  the  comers  by  the  method 
of  Theorem  (3.1.1),  and  then  sample  from  this  to  determine  the  actual  primitive  action 
to  take;  each  of  these  steps  takes  only  polynomial  time.  We  are  effectively  solving  an 
exponentially- sized  matrix  game  (the  game  with  rows  Cn(X)  and  columns  Cn(X)),  albeit 
one  with  a  very  special  structure:  the  exponential  set  of  actions  has  a  low-dimensional 
representation. 

As  an  example,  consider  Figure  (3.1).  The  convex  strategy  set  X  has  corners  Cn(X)  = 
{ci,  C2,  C3,  C4,  C5}.  The  labeled  interior  point  x  falls  inside  the  convex  hull  of  {02,04,05} 

is  quite  common  to  solve  linear  programs  where  enumerating  the  full  set  of  extreme  points  would  be 
impossible 
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Figure  3.1:  Three  different  explicit  representations  of  an  implicit  mixed  strategy  x. 


(left  figure),  which  implies  there  exists  a  mixture  {P2,P4,P5)  ^  A({c2,  C4,  C5})  such  that 
X  =  P2C2  +  P4C4  +  P5C5.  Similarly,  the  middle  figure  implies  there  exist  {qi,q3,q4)  G 
A({ci,  C3,  C4})  such  that  x  =  qiCi  +  q^c^  +  q^c^  and  the  right  figure  shows  that  such  a  dis¬ 
tribution  r  can  be  found  on  {ci,  C3,  C5}  as  well.  Thus  p,  q,  and  r  are  different  explicit  mixed 
strategies  corresponding  to  the  single  implicit  mixed  strategy  x.  Equation  (3.8)  shows  that 
these  explicit  mixed  strategies  all  get  payoff  equal  to  the  payoff  x  achieves  against  any 
opponent  strategy.  In  fact,  there  is  an  infinite  equivalence  class  of  such  distributions  and 
it  is  a  convex  set:  if  we  view  p  and  q  as  distributions  over  the  full  Cn(X),  then  for  any 
6*  G  [0,1]  the  distribution  9p  +  {1  —  9)q  will  also  be  an  explicit  mixed  strategy  equivalent 
to  X.  Also,  observe  that  m  =  2  (that  is,  X  C  M^)  and  it  was  possible  to  represent  x  G  X 
with  a  distribution  supported  by  only  m  +  1  =  3  points.  The  representation  theorem  says 
that  this  holds  for  all  m. 

In  general,  we  can  apply  this  approach  for  any  game  with  a  restricted  set  of  primitive 
actions,  X'  C  X,  as  long  as  every  point  x  G  X  can  be  represented  as  a  convex  combination 
of  points  in  X',  and  we  have  an  efficient  algorithm  to  find  such  a  representation.  This 
argument  will  extend  to  our  generalized  versions  of  stochastic  and  extensive  form  games 
as  well,  and  so  we  do  not  worry  about  these  two  different  interpretations  of  the  action  sets, 
as  from  an  optimization  point  of  view  they  make  no  difference:  in  both  cases,  finding  an 
optimal  implicit  mixed  strategy  is  sufficient. 


3.1.2  Repeated  Convex  Games 

A  convex  game  (X,  F,  M)  can  be  played  in  a  repeated-game  setting  (say  by  x)  even  if  x 
does  not  know  M  or  Y.  On  each  round,  she  selects  some  x  E  X,  and  simultaneously  her 
adversary  y  selects  y  &Y  (but  x  does  not  know  Y).  Then  x  pays  y  x^My,  where  M  is  also 
unknown  to  x.  We  can  play  this  game  using  online  convex  programming  techniques  and 
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achieve  strong  performance  guarantees:  player  x  will  aehieve  at  least  the  minimax  value 
of  the  game,  and  ean  potentially  do  much  better  against  a  sub-optimal  adversary. 

If  after  eaeh  round  x  observes  the  effeetive  cost  vector  My,  then  the  online  linear 
programming  algorithm  of  Kalai  and  Vempala  [2003]  can  be  applied  in  the  ease  of  a 
polyhedron  X,  or  the  online  convex  programming  algorithms  of  Gordon  [1999]  or  Zinke¬ 
vich  [2003]  can  be  used  for  general  eonvex  X.  If  player  x  only  observes  the  amount  of 
the  payoff,  x^My,  then  a  bandit-style  algorithm  must  be  used:  for  example  the  algorithm 
of  McMahan  and  Blum  [2004]  for  polyhedra  (this  algorithm  is  described  in  detail  in  Chap¬ 
ter  6),  or  the  algorithm  of  Flaxman  et  al.  [2005]  for  general  convex  problems.  Note  that 
for  the  performance  guarantees  of  these  algorithms  to  hold,  some  bounds  on  Y  and  M  will 
be  needed:  a  bound  on  the  one-round  maximum  payoff,  sup^^gj^ \x^My\,  is  usually 
suffieient. 

In  fact,  these  algorithms  can  be  used  in  the  general  sum  ease,  as  there  is  no  dependence 
on  player  y’s  incentives  or  payoffs.  In  this  ease  x  is  not  guaranteed  to  approximately 
achieve  the  value  of  any  partieular  equilibrium,  but  will  at  least  achieve  her  safety  value,^ 
and  ean  potentially  do  much  better  (say,  if  playing  against  a  cooperative  player  y).  No¬ 
regret  algorithms  ean  also  be  used  in  an  offline  fashion  to  approximately  solve  for  the 
minimax  equilibria  of  eonvex  games,  using  teehniques  from  Freund  and  Schapire  [1996]. 
The  idea  is  to  run  two  no-regret  algorithms  against  each  other  in  the  game,  which  often 
yields  algorithms  similar  to  fietitious  play.  We  will  eonsider  this  relationship  in  more  detail 
in  Seetion  5.1. 


The  Importance  of  Convex  Games  Before  proeeeding  to  our  examples  of  eonvex 
games,  it  is  worthwhile  to  review  the  importance  of  the  class.  Why  is  it  worthwhile 
to  show  that  a  game  falls  into  this  class?  There  are  both  theoretical  and  computational 
advantages.  By  showing  that  a  game  has  a  eonvex  game  representation,  we  have  also 
shown: 

•  A  minimax  solution  exists,  and  it  ean  be  found  by  convex  programming  in  polyno¬ 
mial  time. 

•  There  are  a  colleetion  of  fast  algorithms  that  ean  be  applied  to  the  problem:  for 
example,  fictitious  play  and  the  bundle-based  oraele  algorithms  of  Chapter  5. 

•  There  exist  eomputationally  efficient  no-regret  algorithms  for  the  game. 

^The  safety  value  of  the  game  for  x  is  the  value  of  the  game  when  player  y  ignores  his  own  payoffs  and 
instead  tries  maximize  player  x’s  loss. 
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We  now  proceed  to  the  examples. 


3.2  Extensive-form  Games 

In  this  section  we  review  extensive-form  games  with  the  aim  of  connecting  known  results 
to  our  notation  and  perspective.  The  formulation  of  EFGs  commonly  used  today  was  orig¬ 
inal  conceived  by  Kuhn  [1953],  who  generalized  an  earlier  formulation  of  von  Neumann. 

Two-player,  zero-sum  extensive-form  games  can  model  competitive  strategic  interac¬ 
tions  that  involve  a  sequence  of  decisions  and  random  events.  The  game  is  specified  via  a 
game  tree,  where  at  each  node  either  one  of  the  players  selects  an  action  (corresponding 
to  a  successor  of  the  current  node)  or  nature  picks  a  random  successor  according  to  a  fixed 
probability  distribution.  Partial  observability  in  the  game  is  modeled  via  information  sets: 
an  information  set  is  a  subset  of  a  player’s  nodes  that  are  indistinguishable  to  the  player. 
That  is,  a  player’s  policy  is  only  allowed  to  be  a  function  of  his  observed  information  set, 
not  the  exact  node  in  the  game  tree;  necessarily,  all  nodes  in  an  information  set  must  have 
an  equal  number  of  successors. 

We  only  consider  games  with  perfect  recall,  which  ensures  each  player’s  information 
sets  form  a  tree.  This  implies  that  all  of  a  player’s  past  actions  and  observations  can  be 
inferred  given  the  current  information  set.  With  perfect  recall,  it  is  sufficient  to  consider 
only  behavior  policies,  that  is,  policies  which  specify  a  probability  distribution  over  actions 
at  each  information  set.  For  the  sequel,  when  we  write  extensive-form  game  (or  EFG) 
we  mean  a  two-player,  zero-sum,  perfect-recall  extensive-form  game  unless  otherwise 
specified. 

As  an  example,  we  consider  the  simple  two-player  two-card  poker  game  shown  in 
Figure  (3.2).  In  this  game  a  dealer  (the  initial  random  node)  gives  each  player  a  single 
face-down  card,  either  the  ace  or  the  king.  Then,  the  players  proceed  to  bet:  first,  player 
X  can  either  fold  (losing  her  ante  of  $1),  or  bet  an  additional  dollar.  If  she  bets,  player  y 
can  either  fold  (losing  his  ante  of  $1),  call  (matching  x’s  bet),  or  bet  (raise  by  matching  x’s 
dollar  and  adding  another).  In  the  case  of  a  call,  the  game  ends,  and  if  y  has  the  best  hand 
(a  deal  of  (K,A)),  then  he  wins  $2  from  x,  otherwise  he  loses  $2  to  x.  If  he  bets,  then  x 
can  either  fold  (losing  $2),  or  call  (adding  another  dollar  to  the  pot  to  match  y’s  raise).  If 
X  calls,  then  the  player  with  the  winning  hands  gets  $3  from  the  other. 

Figure  (3.2)  is  a  representation  of  this  game  as  an  extensive-form  game.  This  repre¬ 
sentation  is  not  unique,  though  it  is  perhaps  the  most  natural.  The  game  tree  has  nodes 
V  =  {ri,  xi,  X2,  pi,  P2, 3:^3, 3:^4}  and  three  information  sets.  Player  x’s  set  of  information 
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(A,  K) 


(K,  A) 


Figure  3.2:  A  simple  poker  game  with  a  two  card  deck  (Ace  and  King),  represented  as  an 
extensive-form  game. 


sets  is  (4  =  where  Ui  =  {xi,X2}  and  %  =  {0:3, 0:4};  player  y’s  set  of  infor¬ 

mation  sets  is  Uy  =  {U2},  where  U2  =  {yi-,y2}-  Information  sets  are  shown  in  the  figure 
by  including  the  constituent  nodes  in  a  rounded  rectangle.  The  set  of  actions  (labels  on 
outgoing  edges,  also  called  choices)  available  to  x  at  ui  is  A{ui)  =  {6,  /},  and  similarly 
A{u2)  =  {/',c',6'}  and  ^(^3)  =  {F,C}.  The  first  node  ri  is  a  random  node,  with  a 
fixed  probability  distribution  (0.5,  0.5)  over  the  two  possible  deals.  The  leaves  indicate 
the  payoffs  from  x  to  y. 

We  can,  in  general,  model  two-player  poker  games  in  this  way.  Each  node  in  the  game 
tree  encodes  a  fixed  setting  of  all  the  cards  dealt  so  far  as  well  as  the  betting  history.  But, 
in  general  there  will  be  some  cards  that  a  player  cannot  see.  At  a  point  in  the  game  where 
a  player  must  select  an  action  (usually  bet,  fold  or  call),  the  nodes  corresponding  to  the 
different  possible  settings  for  the  unobserved  cards  are  grouped  into  an  information  set. 
For  example,  in  our  very  simple  game  the  players  do  not  observe  either  of  the  cards,  and 
so  they  cannot  tell  which  of  the  two  possible  deals  occurred. 

The  key  results  for  extensive-form  games  that  pertain  to  our  work  are  the  fact  that 
extensive-form  games  can  be  transformed  to  convex  games,  and  that  computing  best  re- 
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(a)  Player  x’s  Sequence  Tree  (b)  Player  y’s  Sequence  Tree 

Figure  3.3:  Sequence  trees  for  the  example  poker  game. 


sponses  for  an  extensive-form  game  is  very  fast. 

First,  we  consider  the  representation  of  an  EFG  as  a  convex  game.  Since  a  behavior 
strategy  can  be  represented  as  the  Cartesian  product  of  probability  simplices,  the  set  of 
such  strategies  is  convex.  Unfortunately,  the  payoff  for  a  pair  of  strategies  represented 
this  way  is  not  bilinear  (it  can  not  be  written  as  My),  and  so  the  the  EFG  cannot  be 
written  as  a  convex  game  using  these  strategy  sets.  Instead,  we  turn  to  the  sequence  weight 
representation  of  strategies. 

For  a  given  EFG,  we  construct  a  convex  game  (X,  Y,  M)  where  the  strategy  set  X  has 
one  dimension  for  each  possible  sequence  of  (information  set,  action)  pairs  for  player  x 
(and  analogously  for  Y).  For  the  example  in  Figure  (3.2),  the  possible  sequences  cij  for 
player  x  and  7*  for  player  y  are: 


(To  =  0 

7o  =  0 

=  if) 

7i  =  if] 

0-2  =  (6) 

72  =  ib') 

0-3  =  (6,  F) 

73  =  (c) 

a4  =  ib,C). 
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We  assume  each  possible  action  label  is  associated  with  a  single  information  set,  so  we  can 
omit  the  information  sets  from  the  sequences:  for  example  (6,  C)  uniquely  corresponds  to 
{ui,  b,  U3,  C).  Perfect  recall  ensures  sequences  are  in  one-to-one  correspondence  with  the 
action  labels,  so  in  fact  each  sequence  is  uniquely  identified  by  its  last  action.  Thus,  we 
can  use  action  labels  and  sequences  interchangeably.  The  perfect  recall  assumption  also 
means  that  the  information  sets  of  each  player  (and  hence,  the  sequences)  form  a  tree.  That 
is,  any  non-empty  sequence  has  a  unique  predecessor  sequence. 

The  sequence  trees  for  the  example  EFG  of  Figure  (3.2)  are  given  in  Figure  (3.3). 
The  information  sets  are  shown  as  large  rounded  rectangles,  and  each  edge  out  of  an 
information  set  corresponds  to  a  particular  action  label  and  sequence.  The  small  round 
node  indicates  a  junction  where  the  next  information  set  reached  is  determined  by  choices 
of  the  adversary  and  nature  (shown  as  a  dotted  line);  this  node  is  trivial  in  the  example 
tree  as  it  has  a  single  successor.  For  an  arbitrary  EFG,  junction  nodes  and  information- set 
nodes  alternate  and  junction  nodes  may  have  many  successors. 

We  now  define  a  strategy  representation  that  admits  a  bilinear  objective  function.  The 
strategy  polyhedron  X  has  one  dimension  for  each  sequence  for  x;  elements  of  X  are 
called  sequence  weight  vectors.  The  sequence  weight  Xi  associated  with  the  sequence 
(jj  can  be  interpreted  as  the  probability  that  the  sequence  a*  occurs  under  the  policy  x 
represents,  conditioned  on  the  other  player  and  nature  deterministically  taking  actions 
compatible  with  this  sequence. 

There  is  a  natural  mapping  between  behavior  policies  /9  for  player  x  and  sequence- 
weight  vectors  x  e  X.  If  is  behavior  policy,  the  corresponding  x  has  weight  Xi  on 
sequence  ai  equal  to  product  of  the  probabilities  that  /9  places  on  each  action  in  (jj.  As  an 
example,  suppose  /9  is  the  policy  for  x  that  folds  1/3  of  the  time  at  ui,  and  always  calls  at 
U2.  The  corresponding  sequence- weight  vector  is 


X  = 


1  2 
3’  3’*^’ 


2 

3 


For  any  sequence-weight  vector  x  E  X,  v/e  define  an  associated  behavior  policy  (3x. 
For  example,  given  the  vector  x  above,  the  probability  with  which  jSx  bets  at  ui  is 


Pxiui^h) 


X2 

Xi  +  X2 


21‘i 

1/3  +  2/3 


2/3 


(note  the  zero  indexing  of  x).  Similarly,  the  probability  calls  at  M3  is  0:4/ (0:3  -f  X4),  and 
so  on.  In  general  a)  is  defined  by 


Px{u,a) 


Xa 


'n,a'^A{u) 
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The  policy  (3^  will  be  undefined  at  information  sets  it  never  reaches,  but  this  is  fine.  For 
formal  proofs  that  these  properties  hold  consult  Koller  and  Megiddo  [1992]. 

The  set  X  of  all  sequence  weight  vectors  that  correspond  to  valid  behavior  policies  is 
a  polyhedron.  For  the  example  game,  the  constraints  are: 

0-0  =  1 

(To  =  (Ji  +  (J2 

(02  =  (O3  +  <04 

(Tj  >  0  Vt  e  {0, . . .  ,4}. 

This  technique  generalizes  to  all  extensive-form  games,  and  X  can  always  be  represented 
as  a  polyhedron  with  one  equality  constraint  for  each  information  set  along  with  non¬ 
negativity  constraints  [Koller  and  Megiddo,  1992]. 

The  payoff  matrix  M  corresponding  to  the  sequence- weight  strategy  polyhedra  X  and 
Y  has  one  entry  for  each  pair  of  sequences  ((0,7).  Let  L  be  the  set  of  leaves  for  the 
extensive-form  game,  and  let  m  :  L  — M  give  the  payoff  at  each  leaf.  For  i  G  L,  define 
I (f ,  a,  7)  to  be  1  when  the  set  of  action  labels  on  the  path  to  i  exactly  equals  the  set  of 
action  labels  in  a  U  7,  and  0  otherwise.^  Let  rand(f)  be  the  product  of  the  probabilities  on 
all  random  edges  on  the  path  to  i.  Then,  the  value  M((Tx,  o-y)  is 

^  /(f,  a,  7)rand(f)m(f). 

i£L 

Two  sequences  a  and  7  are  inconsistent  if  no  path  in  the  game  tree  has  exactly  those 
actions:  for  example,  ai  and  71  are  inconsistent  because  it  is  impossible  for  both  players  to 
fold.  For  such  a  sequence  pair,  M{a,  7)  =  0,  and  so  in  general  M  may  have  considerable 
sparsity.  In  order  to  show  that  the  convex  game  {X,  Y,  M)  is  equivalent  to  the  EFG,  one 
must  show  for  all  a;  G  X  and  y  eY  that 


V{P,,Py)=x^My. 

We  do  not  give  a  formal  proof  here;  however,  in  the  next  chapter  we  give  the  corresponding 
proof  for  convex  extensive-form  games. 

Fast  best-response  algorithms  for  extensive-form  games  use  dynamic  programming  on 
the  response  player’s  sequence  tree  (Figure  (3.3)).  If  we  fix  an  opponent  strategy  y,  the 
vector  c  =  My  assigns  a  cost  c{a)  to  each  player  x  sequence  or  equivalently,  action. 

^The  assumption  that  the  action  labels  from  different  information  sets  are  distinct  is  critical  to  this  defi¬ 
nition. 
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A  value  function  on  information  sets  can  then  be  computed  via  dynamic  programming, 
working  from  the  leaves  backward  to  the  root.  For  player  x  (the  min  player),  at  each  leaf 
information  set  a  greedy  action  with  respect  to  the  costs  c{a)  on  the  leaf  sequences  is 
selected,  and  the  corresponding  value  is  assigned  to  the  information  set.  Internal  junction 
nodes  are  given  value  equal  to  the  sum  of  the  values  of  their  predecessor  information  set 
nodes.  Internal  information-set  nodes  have  value  equal  to  the  minimum  over  action  labels 
a  of  the  immediate  cost  c(aa)  plus  the  value  of  the  successor  junction  node  reached  after 
a.  Any  behavior  policy  that  is  greedy  with  respect  to  these  values  is  a  best  response  to  y. 
Computing  these  values  and  reading  off  a  best  response  takes  time  0{m),  since  m  equals 
the  number  of  edges  in  the  tree.  Note  that  computing  c  =  My  is  much  more  expensive,  in 
general  0{mn).  For  a  more  detailed  treatment  of  the  best-response  problem  in  extensive- 
form  games,  see  [Koller  and  Megiddo,  1992];  for  the  transformation  of  extensive-form 
games  to  convex  games  see  [Koller  et  ah,  1994]. 


3.3  Optimal  Oblivious  Routing 

A  significant  amount  of  work  has  been  done  on  the  problem  of  computing  an  oblivious 
routing  for  a  graph,  in  both  the  exact  and  approximate  cases  [see  Azar  et  ah,  2004,  Bi- 
enkowski  et  ah,  2003,  and  references  therein].  However,  the  observation  that  optimal 
oblivious  routing  can  be  expressed  as  a  convex  game  is  new  to  this  thesis. 

Expressing  the  problem  as  a  convex  game  provides  access  to  a  large  array  of  theoretical 
results  and  efficient  algorithms.  For  example,  this  representation  immediately  shows  that 
there  is  a  polynomial- sized  linear  program  for  optimal  oblivious  routing.  This  fact  did  not 
appear  in  the  literature  until  [Applegate  and  Cohen,  2003].  Further,  as  Chapter  5  of  this 
thesis  will  demonstrate,  efficient  algorithms  exist  for  convex  games  that  can  be  much  faster 
than  standard  LP  codes.  While  the  algorithm  of  Azar  et  al.  [2003]  is  impractical  for  large 
real-world  problems,  the  application  of  our  fast  convex  game  solvers  to  this  problem  might 
make  generating  optimal  solutions  to  very  large  oblivious  routing  problems  possible. 

The  polyhedral  convex  game  representation  opens  up  new  possibilities  for  the  online 
(repeated-play)  version  of  the  problem  as  well.  For  example,  the  algorithm  of  Bansal  et  al. 
[2003]  requires  a  call  to  a  projection  oracle  on  each  iteration  in  order  to  perform  gradient 
ascent.  Finding  such  a  projection  requires  solving  a  semi-definite  program.  However, 
the  polyhedral  convex  game  representation  implies  that  the  online  algorithm  of  Kalai  and 
Vempala  [2002]  can  be  used  to  get  similar  bounds,  while  requiring  only  the  solution  of  a 
linear  program^  on  each  iteration.  The  convex  game  representation  also  shows  that  many 

^In  fact,  the  LP  represents  a  standard  multi-commodity  flow  problem. 
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variations  on  the  standard  problem  can  be  solved  in  polynomial  time,  and  automates  the 
process  of  producing  compact  LP  representations  for  these  variants. 

We  state  the  problem  using  the  terminology  of  Azar  et  al.  [2003].  The  optimal  obliv¬ 
ious  routing  game  is  specified  by  a  directed  graph  G  =  (V,  E)  with  edge  capacities 
c  E  ^  M_i_.  The  routing  player  selects  a  routing  for  traffic  in  G,  represented  by  a 
one-unit  flow  fij  from  each  vertex  i  eV  to  each  vertex  j  ^  i.  The  term  oblivious  refers  to 
the  fact  that  the  demands  are  not  known  at  the  time  this  routing  is  computed.  The  variable 
fij{e)  specifies  the  volume  of  flow  on  edge  e  associated  with  routing  1  unit  of  volume  from 
i  to  j  using  fij.  Thus,  routing  demand  dij  from  i  to  j  using  results  in  volume  dijfij{e) 
on  each  edge  e.  The  set  E  of  all  valid  routings  is  convex  and  specified  by  a  number  of 
constraints  polynomial  in  the  size  of  the  input  graph. 

Given  a  set  of  demands  d  =  {djj  G  M  |  f ,  j  G  V,  f  7^  j,  dij  >  0},  the  congestion  on  an 
edge  e  under  routing  /  is  given  by 


econg(e,/,  d) 


c(e) 


The  congestion  of  the  routing  is  then 


cong(/,  d)  =  max  econg(e,  /,  d). 

e£E 

For  a  fixed  set  of  demands  d,  there  exists  a  minimal  congestion  routing,  with  congestion 


opt(d)  =  min 

/SF 


cong(/,  d) 


which  can  be  found  via  linear  programming. 


lf\E\  =  n,  then  there  are  n{n  —  1)  flows  that  define  /,  and  each  one  has  an  associated 
demand.  Let  D  =  {d  G  |  >  g}  be  the  set  of  all  possible  demands.  The 

adversary  in  the  optimal  oblivious  routing  game  selects  a  demand  d  G  D,  and  the  overall 
game  is  then 


min  max 
/eF  den 


cong(/, d) 
opt(d) 


While  E  and  D  are  convex  sets,  the  objective  function  is  not  bilinear.  Instead,  we  re¬ 
formulate  the  objective  as  follows.  Scaling  d  by  a  positive  constant  does  not  change  the 
objective,  as  it  scales  both  the  numerator  and  denominator  equally.  Thus,  it  is  equivalent 
to  optimize  over  the  set  Di  =  {d  E  D  \  opt(d)  <  1}.  Using  this  observation  and  the 
definition  of  con g,  we  can  rewrite  the  game  as 


min  max  max  econgfe,  f,  d) 

/6F  rfgDi  eSF  toV  >  ^  W 
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Now  we  have  a  game  with  a  multi-linear  objective  function  (because  the  capacities  c(e) 
are  a  constant  given  by  the  problem  specification).  The  set  Di  is  in  fact  a  polyhedron, 
as  shown  by  Equations  (8)  and  (9)  of  [Azar  et  ah,  2003].  The  key  idea  is  to  think  of 
the  adversary  as  picking  an  arbitrary  multi-commodity  flow  with  congestion  at  most  1; 
the  corresponding  demands  d  are  a  linear  function  of  the  multi-commodity  flow.  But,  the 
objective  is  not  bilinear  if  we  think  of  the  same  player  (the  adversary)  choosing  both  d  and 
e.  Since  we  can  imagine  the  f  E  F  as  fixed  when  the  adversary  chooses  his  action,  we 
can  allow  the  adversary  to  pick  an  arbitrary  distribution  /i  E  A(i?),  giving  the  game 

min  max  max  >  u(e)econg(e,  f,  d).  (3.9) 

f£F  iM&AiE)  deDi,  ^  £>\  / 

e£E 

If  H  E  A(i?)  was  a  fixed  parameter  of  the  problem,  for  example,  if  we  wished  to  minimize 
the  average  edge  congestion  rather  than  the  maximum,  then  we  would  be  done.  However, 
the  objective  as  given  above  is  non-linear  for  the  max  player:  to  form  a  convex  game,  we 
need  the  value  to  be  a  linear  function  of  a  fixed  adversary  strategy  /. 

We  sketch  a  way  that  this  can  be  accomplished;  the  details  of  the  technique  are  actu¬ 
ally  closely  related  to  the  convex  extensive-form  game  formulation  introduced  in  the  next 
section.  Let  Dl  =  {{ad,  a)  \  d  E  Di,a  >fi}.  We  call  this  set  the  cone  extension  of  Di, 
since  Di  is  a  polyhedron,  so  is  (see  Appendix  B).  We  define  the  adversary’s  strategy 
polyhedron  Dl  by  the  variables  d)j  (e)  and  /i(e)  via  the  following  constraints: 

/i  e  A{E) 

(d"(e),a(e))  E  Dl  'ie  E  E 
fd{e)  =  a{e)  E  E 


It  can  then  be  shown  that  the  optimal  oblivious  routing  game  (Equation  3.9)  is  equivalent 
to  the  convex  game: 


min  max  — — 

^c(e) 


(3.10) 


The  equivalence  follows  from  the  fact  that  for  all  e,  dfj(e)  =  /j,{e)dij{e)  for  some  demands 
d{e)  E  Dl.  We  are  thus  allowing  the  max  player  to  pick  a  different  set  of  demands  for 
each  edge,  but  since  the  max  is  achieved  at  a  single  edge  this  does  not  change  the  value 
of  the  game.  Eor  a  fixed  /,  the  best  response  problem  is  exactly  the  problem  solved  by 
the  separation  oracle  of  Azar  et  al.  [2003],  which  solves  an  independent  problem  for  each 
edge  e  E  E.  It  is  worth  noting  that  the  linear  program  due  to  Applegate  and  Cohen 
[2003]  is  different  than  the  one  obtained  by  applying  Equation  (3.4)  to  the  convex  game 
of  Equation  (3.10).  The  relative  merits  of  the  two  different  LP  formulations  have  not  been 
investigated  as  of  yet. 


71 


The  convex  game  representation  of  Equation  (3.10)  implies  that  many  variations  on 
the  basic  problem  are  also  solvable  in  polynomial  time.  For  example,  we  can  introduce 
additional  constraints  on  fx,  replacing  A(i?)  with  an  arbitrary  convex  subset  of  A(i?). 
Similarly,  we  could  further  constrain  Di  to  only  allow  the  adversary  to  pick  demands  that 
are  convex  combinations  of  demands  that  have  been  observed  in  the  past.  Since  these 
transformations  preserve  the  convexity  of  the  adversary’s  strategy  set,  the  game  remains 
convex. 


3.4  MDPs  with  Adversary-controlled  Costs 


We  investigate  methods  for  planning  in  a  Markov  Decision  Process  where  the  cost  function 
is  chosen  by  an  adversary  after  a  policy  for  the  MDP  has  been  chosen  by  the  planning 
player.  First  we  consider  the  case  where  the  opponent  is  restricted  to  a  finite  set  of  cost 
functions,  and  then  we  consider  the  case  of  an  arbitrary  convex  set  of  cost  vector.^  The 
later  situation  includes  games  where  the  cost  function  in  player  one’s  MDP  is  a  linear 
function  of  the  state-action  frequency  representation  of  the  policy  chosen  by  player  two 
in  another  MDP.  This  work  originally  appeared  in  [McMahan  et  ah,  2003,  McMahan  and 
Gordon,  2003].  This  section  provides  all  the  necessary  background  to  read  Section  (5.2); 
the  reader  with  immediate  interest  in  algorithmic  approaches  should  feel  free  to  consult 
that  section  after  completing  the  present  one. 

As  a  running  example,  we  consider  a  robot  path  planning  problem  where  costs  are 
influenced  by  sensors  that  an  adversary  places  in  the  environment.  We  formulate  the  prob¬ 
lem  as  a  zero-sum  matrix  game  where  rows  correspond  to  deterministic  policies  for  the 
planning  player  and  columns  correspond  to  cost  vectors  the  adversary  can  select.  This  ex¬ 
ponentially  large  matrix  game  has  a  concise  representation  as  a  convex  game;  we  explore 
that  representation  and  other  details  of  the  problem  formulation  in  this  section.  For  a  fixed 
cost  vector,  fast  algorithms  (such  as  value  iteration)  are  available  for  solving  MDPs.  In 
Chapter  5,  we  develop  algorithms  that  use  these  fast  best  response  oracles,  and  show  that 
for  our  path  planning  problem  they  can  be  several  orders  of  magnitude  faster  than  direct 
solution  of  the  linear  programming  formulation. 


^Since  we  consider  only  finite  state  and  action  spaces,  we  use  the  terms  cost  function  and  cost  vector 
interchangeably. 
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3.4.1  Introduction  and  Motivation 


Imagine  a  robot  in  a  known  (previously  mapped)  environment  whieh  must  navigate  to 
a  goal  loeation.  We  wish  to  ehoose  a  path  for  the  robot  that  will  avoid  deteetion  by  an 
adversary.  This  adversary  has  some  number  of  fixed  sensors  (perhaps  surveillanee  eameras 
or  stationary  robots)  that  he  will  position  in  order  to  deteet  our  robot.  These  sensors  are 
undeteetable  by  our  robot,  so  it  eannot  diseover  their  loeations  and  ehange  its  behavior 
aeeordingly.  What  path  should  the  robot  follow  to  minimize  the  time  it  is  visible  to  the 
sensors?  Or,  from  the  opponent’s  point  of  view,  what  are  the  optimal  loeations  for  the 
sensors? 

We  assume  that  we  know  the  sensors’  eapabilities.  That  is,  given  a  sensor  position  we 
ean  ealeulate  what  portion  of  the  world  it  ean  observe.  So,  if  we  know  where  the  oppo¬ 
nent  has  placed  sensors,  we  can  compute  a  cost  vector  for  the  MDP:  each  entry  assigns  a 
constant  observation  cost  to  each  world  state  observed  by  a  sensor.  We  also  add  a  small 
movement  cost,  so  that  the  robot  prefers  shorter  paths  to  the  goal.  Given  this  fixed  cost 
vector,  we  can  apply  efficient  planning  algorithms  (value  iteration  in  stochastic  environ¬ 
ments,  A*  search  in  deterministic  environments)  to  find  a  path  for  the  robot  that  minimizes 
the  total  observation  time.  Of  course,  in  the  full  game  we  don’t  know  the  sensor  locations; 
instead  we  have  a  set  of  possible  cost  vectors,  one  for  each  allowable  sensor  configura¬ 
tion,  and  we  must  minimize  the  expected  cost  under  the  worst-case  distribution  over  cost 
vectors.  In  this  section,  we  discuss  different  ways  to  model  the  general  problem  we  have 
described,  and  discuss  the  variation  we  solve.  In  particular,  we  show  that  the  problem  can 
be  formulated  as  a  convex  game. 

Our  algorithms  are  practical  for  problems  of  realistic  size,  and  we  have  used  our  imple¬ 
mentation  to  find  plans  for  robots  playing  laser  tag  as  part  of  a  larger  project  [Rosencrantz 
et  ah,  2003].  Figure  (3.4)  shows  the  optimal  solutions  for  both  players  for  a  particular 
instance  of  the  problem.  The  map  is  of  Rangos  Hall  at  Carnegie  Mellon  University,  with 
obstacles  corresponding  to  overturned  tables  and  boxes  placed  to  create  an  interesting  en¬ 
vironment  for  laser  tag  experiments.  The  optimal  strategy  for  the  planner  is  a  distribution 
over  paths  from  the  start  (s)  to  one  of  the  goals  (g),  shown  in  (3.4)A;  this  corresponds 
to  a  mixed  strategy  in  the  matrix  game,  that  is,  a  distribution  over  the  rows  of  the  game 
matrix.  The  optimal  strategy  for  the  opponent  is  a  distribution  over  sensor  placements,  or 
equivalently  a  distribution  over  the  columns  of  the  game  matrix.  This  figure  is  discussed 
in  detail  in  Section  (3.4.4). 
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3.4.2  Model  Formulation 


There  are  a  number  of  ways  we  could  model  our  planning  problem.  The  model  we  choose, 
which  we  call  the  no  observation,  single  play  formulation,  corresponds  to  the  assumptions 
outlined  above.  Initially,  we  restrict  the  opponent  to  choosing  a  cost  vector  from  a  finite 
though  possibly  large  set,  but  later  we  relax  this  to  allow  arbitrary  convex  sets  of  possi¬ 
ble  costs.  The  planning  agent  knows  this  set  as  well  as  the  dynamics  of  the  MDP,  and 
so  constructs  a  policy  that  optimizes  worst-case  expected  cost  given  these  allowed  cost 
vectors.  Let  be  the  set  of  proper  deterministic  policies  available  to  the  planning  agent, 
let  it'  =  {ci, . . . ,  Cfc}  be  the  set  of  cost  vectors  available  to  the  adversary,  and  let  V (vr,  c) 
be  the  value^  of  policy  tt  G  11  d  under  cost  vector  c  G  K.  The  goal  is  to  solve  the  matrix 
game  with  one  row  for  each  n  E  Tin  and  one  column  for  each  c  E  K',  the  entry  in  the 
payoff  matrix  for  row  vr  and  column  c  is  then  V (vr,  c).  Equivalently,  we  wish  to  solve  the 
optimization 

min  max  E.„r^„cr~.q\V{'K,c)],  (3.11) 

peA{no)  qeA(ir) 

along  with  the  distributions  p  and  q  that  achieve  this  value.  We  now  discuss  the  assump¬ 
tions  behind  this  formulation  of  the  problem  in  more  detail,  and  examine  several  other 
possible  formulations. 

Our  most  limiting  assumption  is  that  our  planning  agent  cannot  observe  the  adver¬ 
sary’s  effect  on  the  cost  vector.  In  our  example  domain,  the  robot  incurs  fixed,  observable 
costs  for  moving,  running  into  objects,  etc.;  however,  it  cannot  determine  when  it  is  being 
watched  and  so  it  cannot  determine  the  cost  vector  selected  by  the  adversary.  This  is  a 
reasonable  assumption  for  some  domains,  but  not  others.  If  the  assumption  does  not  hold, 
our  algorithms  will  produce  suboptimal  policies:  for  example,  we  would  not  be  able  to 
plan  to  check  whether  a  path  was  being  watched  before  following  it. 

The  no-observation  assumption,  while  sometimes  unrealistic,  is  what  allows  us  to  de¬ 
velop  efficient  algorithms.  Without  this  assumption,  the  planning  problem  in  general  be¬ 
comes  a  partially-observable  Markov  decision  process  even  when  we  know  the  distribution 
over  cost  vectors  the  opponent  has  chosen:  the  unknown  cost  vector  is  the  hidden  state  and 
the  costs  incurred  are  observations.  POMDPs  are  known  to  be  difficult  to  even  approxi¬ 
mately  solve  [Kaelbling  et  ah,  1996];  on  the  other  hand,  the  planning  problem  without  ob¬ 
servations  admits  polynomial-time  algorithms,  as  we  will  show.  Later,  we  will  use  a  gen¬ 
eralization  of  extensive-form  games  (introduced  in  Chapter  4)  to  relax  the  no-observation 
assumption  somewhat.  We  will  be  able  to  model  problems  where  the  planning  player  can 

^That  is,  the  expected  value  of  the  start  state  under  policy  tt.  This  can  be  found  by  solving  a  set  of  linear 
equations;  see  Section  2.2 
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Figure  3.4:  Planning  in  a  robot  laser  tag  environment.  Part  A:  A  mixture  of  optimal 
trajeetories  for  a  robot  traveling  from  start  loeation  (s)  to  one  of  three  goals  (g).  The 
opponent  ean  put  a  sensor  in  one  of  4  loeations  (x),  faeing  one  of  8  direetions.  The  widths 
of  the  trajeetories  eorrespond  to  the  probability  that  the  robot  takes  the  given  path.  Parts 
B,C,D,E:  The  optimal  opponent  strategy  randomizes  among  the  sensor  plaeements  that 
produee  these  four  fields  of  view. 


make  some  limited  observations  (perhaps  only  from  a  eonstant  number  of  states,  or  on 
only  at  a  eonstant  number  of  fixed  times)  and  still  maintain  eomputational  traetability. 

In  addition  to  the  POMDP  formulation,  our  problem  ean  also  be  framed  in  an  online 
setting  where  the  MDP  must  be  solved  multiple  times  for  different  eost  veetors.  The 
planning  agent  must  piek  a  poliey  for  the  nth  game  based  on  the  eost  veetors  it  has  seen  in 
the  first  n  —  1  games.  The  goal  is  to  do  well  in  total  eost,  eompared  to  the  best  fixed  poliey 
against  the  opponent’s  sequenee  of  seleeted  eost  veetors.  To  obtain  traetable  algorithms  we 
still  make  the  no-observation  assumption,  but  it  is  not  neeessary  to  assume  the  opponent 
ehooses  eost  veetors  from  a  finite  or  eonvex  set.  When  this  formulation  is  applied  to 
shortest  path  problems  on  graphs,  it  is  the  online  shortest  path  problem  for  whieh  some 
effieient  algorithms  are  already  known  [Takimoto  and  Warmuth,  2002].  Onee  we  have 
shown  the  transformation  to  an  equivalent  eonvex  game,  the  algorithms  for  the  repeated 
setting  diseussed  at  the  end  of  Seetion  3.1  immediately  apply. 
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It  is  worth  noting  the  relationship  between  our  problem  and  stochastic  games.  Our 
setting  is  more  general  in  some  ways  and  less  general  in  others:  we  allow  hidden  state  (the 
cost  function),  but  stochastic  games  allow  players  to  make  a  sequence  of  interdependent 
moves  while  we  require  both  players  to  select  their  policies  simultaneously  at  the  outset. 
Our  work  also  differs  from  that  of  Bagnell  et  al.  [2001],  in  that  they  consider  uncertainty 
about  the  dynamics  model,  while  we  consider  uncertainty  about  the  cost  function. 

In  general,  the  no-observation  assumption  is  applicable  in  two  cases:  when  observa¬ 
tions  are  actually  impossible,  and  when  observations  are  possible,  but  once  they  have  been 
made  there  is  nothing  to  be  done.  The  way  we  initially  phrased  our  robot  path-planning 
problem,  it  falls  in  the  first  case:  the  sensors  cannot  be  detected.  On  the  other  hand,  if  we 
can  detect  a  sensor  but  have  already  lost  the  game  once  we  detect  it,  the  problem  falls  in 
the  second  case. 

So  far  we  have  imagined  an  adversary  selecting  one  cost  vector  from  a  set  of  cost 
vectors;  however,  our  formulation  applies  to  the  case  where  the  actual  cost  is  given  by  the 
highest  cost  of  the  chosen  policy  with  respect  to  any  of  the  cost  vectors.  For  example, 
suppose  there  is  a  competition  to  control  a  robot  performing  an  industrial  welding  task.  In 
the  first  round  the  robots  will  be  evaluated  by  three  human  judges,  each  of  which  has  the 
ability  to  remove  a  robot  from  consideration.  It  is  known  that  one  judge  will  prefer  faster 
robots,  another  will  be  more  concerned  with  the  robots’  power  consumption,  and  another 
with  the  precision  with  which  the  task  is  performed.  If  the  task  is  formulated  as  an  MDP, 
then  each  judge’s  preference  can  be  turned  into  a  cost  vector,  and  our  algorithm  will  find 
the  policy  that  maximizes  the  lowest  score  given  by  any  of  the  three  judges.  The  policy 
calculated  will  be  optimal  if  all  three  judges  evaluate  the  policy  and  then  assign  the  lowest 
of  their  scores,  or  if  an  adversary  picks  a  distribution  from  which  only  one  judge  will  be 
chosen. 

We  now  proceed  with  some  background  on  MDPs  and  linear  programming,  and  then 
present  the  transformation  to  a  convex  game. 


3.4.3  Solving  MDPs  with  Linear  Programming 

In  this  section,  we  review  techniques  for  solving  MDPs  via  linear  programming,  as  this 
background  will  lead  directly  to  a  linear  programming  formulation  of  the  adversarial  MDP 
model  as  well  as  provide  the  tools  for  transforming  the  problem  into  an  equivalent  convex 
game. 

Consider  an  MDP  M  with  a  state  set  S  and  action  set  A.  The  dynamics  for  the  MDP 
are  specified  for  all  s,  s'  e  S  and  a  G  A  by  P“g,,  the  probability  of  moving  to  state  s' 
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if  action  a  is  taken  from  state  s.  In  order  to  express  problems  regarding  MDPs  as  linear 
programs,  it  is  useful  to  define  a  matrix  E  as  follows:  E  has  one  row  for  every  state-action 
pair  and  one  column  per  state.  The  entry  for  row  (s,  a)  and  column  s'  contains  for 
all  s  7^  s',  and  —  1  for  s  =  s'.  A  cost  function  for  the  MDP  can  be  represented  as 
a  vector  c  that  contains  one  entry  for  each  state-action  pair  (s,  a)  indicating  the  cost  of 
taking  action  a  in  state  s.  A  stochastic  policy  for  an  MDP  is  a  mapping  vr  :  S'  x  A  — >  [0, 1], 
so  that  7r(s,  a)  gives  the  probability  an  agent  will  take  action  a  in  state  s.  Thus,  for  all  s 
we  must  have  a)  =  1.  A  deterministic  policy  is  one  that  puts  all  its  probability 

on  a  single  action  for  each  state,  so  that  it  can  be  represented  by  tt  :  S'  — A.  The 
Markov  assumption  implies  that  we  do  not  need  to  consider  history dependent  policies; 
the  policies  we  consider  are  stationary,  in  that  they  depend  only  on  the  current  state.  For 
an  MDP  with  a  fixed  cost  function  c  there  is  always  an  optimal  deterministic  policy,  and 
so  stochastic  policies  play  a  lesser  role.  In  our  adversarial  formulation,  however,  optimal 
policies  are  typically  stochastic. 

We  are  primarily  concerned  with  undiscounted  shortest  path  optimality:  that  is,  all 
states  have  at  least  one  finite-length  path  to  a  zero-cost  absorbing  state,  and  so  undis¬ 
counted  costs  can  be  used.  Our  results  can  be  adapted  to  discounted  infinite  horizon  prob¬ 
lems  by  multiplying  all  the  probabilities  ,  by  a  discount  factor  7  when  the  matrix  E  is 
formed.  The  results  can  also  be  extended  to  an  average  reward  model,  but  this  requires 
slightly  more  complicated  changes  to  the  linear  programs  introduced  below. 

There  are  two  natural  representations  of  a  policy  for  a  MDP,  one  in  terms  of  frequen¬ 
cies  and  another  in  terms  of  total  costs  or  values.  Each  arises  naturally  from  a  different 
linear  programming  formulation  of  the  MDP  problem.  For  any  policy  vr  we  can  compute 
a  value  function,  :  S  ^  M.,  that  associates  with  each  state  the  total  cost  n7r(s)  that  an 
agent  will  incur  if  it  follows  tt  from  s  for  the  rest  of  time.  If  tt  is  optimal  then  the  policy 
achieved  by  acting  greedily  with  respect  to  Vt^  is  optimal.  Thus,  value  functions  can  rep¬ 
resent  deterministic  greedy  policies,  but  not  arbitrary  stochastic  policies;  hence,  to  find  an 
optimal  policy  for  our  adversarial  problem  we  will  need  a  different  policy  representation. 

The  optimal  value  function  for  an  MDP  with  cost  vector  c  and  fixed  start  state  distri¬ 
bution  fig  e  A (5')  can  be  found  by  solving  the  following  linear  program: 


max  fig  ■  V 

V 

subject  to  Ev  -f  c  >  0. 


(3.12) 


'®We  assume  the  standard  definition  of  the  history,  where  it  contains  only  states  and  actions.  If  costs 
incurred  appear  in  the  history  then  our  formulation  does  not  apply,  as  we  are  in  the  POMDP  case. 
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The  set  of  constraints  Ev  +  c  >  0  is  equivalent  to  the  statement  that 

t;(s)  <  c(s,  a)  +  ^ 
s'es 

for  all  s  G  S'  and  a  e  A,  corresponding  to  the  Bellman  equations  (see  Section  2.2). 

Fixing  an  arbitrary  stochastic  policy  vr  and  start  state  distribution  /is  uniquely  deter¬ 
mines  a  set  of  state-action  visitation  frequencies  /  G  where  f{s,  a)  gives  the  ex¬ 

pected  number  of  times  action  a  is  taken  from  state  s  before  the  goal  is  reached,  given  the 
the  initial  state  is  drawn  from  /ig  and  the  agent  follows  tt.  We  write  when  we  wish  to 
show  the  dependence  on  tt;  the  dependence  on  /ig  is  implicit.  The  dual  of  (3.12)  is  the 
linear  program  whose  feasible  region  is  the  set  of  state-action  visitation  frequency  vectors 
(that  correspond  to  some  stochastic  policy),  and  is  given  by 


min  f  ■  c 

/ 

subject  to  E'^f  -f  /Tg  =  0 

/>0. 


(3.13) 


The  constraints  E'^  f  -f  /ig  =  0  require  that  the  sum  of  all  the  frequencies  into  a  state  x 
equal  the  sum  of  all  the  frequencies  out  of  x.  The  objective  /  ■  c  represents  the  expected 
value  of  the  start  state  drawn  from  /ig  under  the  policy  tt  which  corresponds  to  /.  For  any 
cost  vector  c  we  can  compute  the  value  of  tt  as 


V  (tt,  c)  =  a  ■  c. 


(3.14) 


3.4.4  Representation  as  a  Convex  Game 

Our  game  will  have  the  convex  strategy  set 

F  =  {/  G  I  F^/  +  /ig  =  0,  /  >  0}  (3.15) 

for  the  planning  player.  As  mentioned  in  the  previous  section,  there  is  a  correspondence 
between  the  set  F  and  the  set  of  stochastic  policies.  In  fact,  this  correspondence  is  very 
similar  to  the  correspondence  between  behavioral  policies  and  sequence- weight  vectors  in 
extensive-form  games.  As  in  that  case,  a  state-action  visitation  vector  /  will  not  specify  the 
distribution  over  actions  to  be  taken  at  states  never  reached  by  the  corresponding  policy. 

It  can  be  shown  that  Cn(F)  =  {f^  |  tt  G  II^)},  that  is,  corners  of  F  are  determin¬ 
istic  policies.  Thus,  a  mixed  strategy  p  G  A(n£))  for  the  row  player  in  the  matrix  game 
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of  Equation  (3.11)  corresponds  exactly  to  an  explicit  mixed  strategy  over  F,  and  hence 
is  equivalent  to  the  implicit  mixed  strategy  /  =  other  words,  every 

stochastic  policy  (when  represented  as  a  state-action  visitation  frequency  vector)  can  be 
represented  as  a  convex  combination  of  deterministic  policies,  and  every  convex  combi¬ 
nation  of  deterministic  policies  corresponds  to  some  stochastic  policy.  Puterman  [1994, 
Sec.  6.9]  gives  a  detailed  proof  of  this  fact;  it  can  also  be  proved  as  a  consequence  of 
Theorem  (3.1.1). 

We  denote  the  convex  hull  of  a  finite  set  such  as  K  by 

Uex 

A  mixed  strategy  q  G  A  (A')  for  the  column  player  is  equivalent  in  expectation  to  the 
cost  vector  the  convex  set  H(A').  Given  a  state-action  frequency  vector 

f  &  F  and  an  implicit  mixed  cost  vector  c  G  H(A'),  the  value  of  the  game  is  given  by 
c  ■  /  by  Equation  (3.14).  Thus,  we  can  reduce  our  exponentially  large  matrix  game  (given 
by  Equation  (3.11)  to  the  convex  game  (F,  H(A'),  I)  where  I  is  the  identity  matrix.  We 
can  also  define  the  equivalent  convex  game  {F,  A(A'),  Mk),  where  Mk  is  the  matrix  with 
columns  ci, . . . ,  c^.  The  two  representations  of  the  convex  game  correspond  to  writing  the 
objective  function  as  /^/(M^^g)  versus  f'^Mj^q. 

While  we  could  conclude  this  section  here,  it  is  instructive  to  directly  construct  the 
linear  programs  for  the  convex  game  (F,  A(F),  M^)-  We  do  so  by  extending  (3.13)  with 
another  variable  2:  which  represents  the  maximum  cost  of  the  policy  /  over  all  possible 
opponent  cost  vectors: 

min  2:  (3.16) 

Z,f 

subject  to  E^f  +  fig  =  0 

1  ■  z  +  M'^f  <  0 
/>0. 

The  primal  variables  /  of  (3.16)  give  an  optimal  implicit  mixed  strategy  for  the  plan¬ 
ning  player.  Taking  the  dual  of  (3.16),  we  have 

max  V  ■  fig  (3.17) 

v,q 

subject  to  Ev  +  M^q  >  0 
1  -g  =  1 
7  >  0, 
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where  q  gives  the  optimal  mixture  of  costs  for  the  adversary,  and  v  is  the  value  function 
when  playing  against  this  distribution.  The  value  function  v  induces  a  deterministic  policy 
that  gives  a  best  response  if  the  opponent  chooses  the  distribution  q  over  cost  vectors,  but 
in  general  this  pair  of  strategies  will  not  be  a  minimax  equilibrium.  However,  the  pair 
(/,  q)  pair  will  be.  It  is  straightforward  to  verify  that  Equation  (3.16)  is  a  special  case  of 
Equation  (3.4)  and  that  Equation  (3.17)  is  a  special  case  of  Equation  (3.5). 

Eigure  (3.4)  shows  a  solution  to  the  robot  path  planning  problem  formulated  in  this 
way.  The  left  portion.  A,  shows  the  minimax  optimal  strategy  for  the  planner.  The  sample 
problem  has  deterministic  dynamics,  so  a  deterministic  policy  from  is  simply  a  start  to 
goal  path  (note  there  are  3  goal  states  in  the  example  domain).  The  optimal  stochastic  pol¬ 
icy  is  shown  as  a  distribution  over  deterministic  paths;  the  width  of  the  path  line  indicates 
the  relative  probability  with  which  it  is  selected.  Each  of  the  four  right-hand  panels  (B,  C, 
D,  and  E)  corresponds  to  a  cost  vector  from  K,  shown  as  the  field  of  view  of  the  sensor 
placement.  These  four  are  the  most  likely  cost  vectors  selected  by  the  opponent’s  minimax 
optimal  policy;  they  are  chosen  with  probability  0.18,  0.42,  0.11,  and  0.28  respectively. 
The  remainder  of  the  probability  mass  is  on  other  sensor  placements. 

Unfortunately,  many  interesting  MDPs  are  too  large  to  allow  efficient  solution  via  lin¬ 
ear  programming,  and  so  neither  of  the  above  linear  programs  may  be  practical;  however, 
for  a  fixed  cost  function  value  iteration  or  other  MDP  algorithms  can  solve  such  large 
problems.  In  Chapter  5  we  develop  techniques  that  allows  us  to  use  an  arbitrary  MDP 
solution  technique  as  a  best-response  oracle  in  an  iterative  algorithm  for  solving  (3.16). 


3.4.5  Cost-paired  MDP  Games 

In  this  section  we  described  a  generalization  to  the  adversarial-cost  MDPs  of  the  previous 
section.  In  cost-paired  MDP  games,  both  players  select  a  policy  in  a  separate  MDP,  but 
the  costs  associated  with  a  policy  in  one  of  the  MDPs  depend  (via  a  linear  function)  on 
the  policy  selected  by  the  other  player.  These  cost-paired  MDPs  games  represent  an  inter¬ 
esting  and  computationally  tractable  class  of  adversarial  planning  problems;  they  can  be 
formulated  as  polynomial- sized  convex  games. 


Revisiting  the  Sensor-placement  Game 

In  the  previous  section  we  considered  a  fixed,  finite  set  of  possible  sensor  configurations 
that  determined  costs;  the  techniques  introduced  in  this  section  let  us  consider  a  mobile 
sensor  platform  that  must  decide  on  an  observation  strategy  represented  as  a  policy  in  an 
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MDP.  The  observer’s  rewards  depend  on  its  own  policy  as  well  as  on  the  motion  of  the 
entity  which  it  is  trying  to  observe.  Suppose  that  the  output  from  the  sensor  cannot  be 
processed  in  real  time  due  to  latency,  insufficient  on-board  computation,  or  the  need  for 
human  expert  analysis;  suppose  also  that  the  entity  being  observed  is  aware  that  it  may  be 
observed,  but  cannot  detect  when  observations  happen. 

One  natural  instance  of  this  problem  is  scientific  data  collection  from  a  satellite  or 
planetary  rover.  We  want  to  maximize  the  amount  of  time  that  the  sensor  spends  observing 
a  particular  natural  phenomenon.  Communication  delays  prevent  the  sensor  from  altering 
its  actions  based  on  the  data  collected  so  far.  Nature  is  oblivious  to  the  sensor’s  actions, 
but  we  treat  her  as  an  adversary  in  order  to  compute  a  robust  plan.  We  need  not  model 
nature  as  purely  adversarial:  to  the  extent  we  have  good  estimates  of  the  probabilities  that 
govern  the  behavior  of  nature,  we  can  embed  these  into  nature’s  MDP.  In  this  way  the 
only  degrees  of  freedom  we  leave  the  adversary  correspond  to  uncertainty  for  which  we 
have  no  good  statistical  model.  Note  that  if  we  play  this  game  multiple  times,  then  we  can 
use  online  learning  (see  Chapter  6)  to  capitalize  on  the  fact  that  nature  may  not  be  purely 
adversarial  even  if  we  lack  any  probabilistic  model. 


Problem  Model 

We  have  a  two-player,  zero-sum  game,  with  players  x  and  y  as  usual.  Let  Afx  = 
{S^ ,  and  Aiy  =  {S^ ,  ,  P'^ ,  pi)  he  MDPs,  one  for  each  player.  For  each 

MDP,  S'  is  a  finite  set  of  states,  A  is  a  finite  set  of  actions,  P  :  {S  x  A)  A(S')  is  a 
transition  function,  and  ps  is  a  distribution  over  start  states.  Each  MDP  would  normally 
have  a  vector  of  state-action  costs,  but  we  leave  the  costs  unspecified  for  now;  costs  in 
Afx  will  depend  on  the  policy  in  y  chooses  for  Afy,  and  vice  versa.  Let  m  =  |S'’^||A’'| 
and  n  =  [S'^ |  |.  Let  11^  (A^nd)  be  the  set  of  deterministic  (stochastic)  policies  for  Mx, 

and  define  11^  and  Ff^^  analogously  for  Aiy.  We  rule  out  policies  with  infinite  visitation 
frequencies;  we  can  do  so  either  by  introducing  a  discount  factor  (in  which  case  all  dis¬ 
counted  frequencies  will  be  finite)  or  by  assuming  positive  edge  costs  for  X,  negative  costs 
for  Y,  and  no  “orphan”  states  (in  which  case  the  agents  will  never  choose  nonterminating 
policies).  As  we  observed  in  the  previous  section,  the  set  can  be  represented  via  a 
convex  set  X  of  state-action  visitation  frequencies  using  Equation  (3.15),  and  similarly 
can  be  represented  by  Y.  These  will  be  our  strategy  sets  for  the  convex  game;  it 
remains  to  define  the  payoffs. 

The  cost  vector  for  Aix  will  be  a  linear  function  of  y’s  state-action  visitation  frequen¬ 
cies  y  G  Y,  and  vice  versa.  In  particular,  we  define  the  value  of  a  pair  of  policies  x  G  X 
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and  y  eY  as 

V{x,y)  =  ■  X  +  x'^Gy  —  ■  y. 

Here  G  and  G  are  fixed  cost  vectors  for  X  and  Y,  while  the  matrix  G  governs  the  inter¬ 
action  between  the  two  players;  the  fixed  costs  may  account  for  movement  costs  or  other 
costs  in  the  game  that  are  independent  of  the  other  player’s  policy.  Since  we  interpret  x 
as  the  min  player  and  y  as  the  max  player,  y’s  fixed  cost  G  ■  y  gets  a  negative  sign.  To 
represent  as  a  bilinear  form  x^My  we  can  add  a  new  dimension  with  a  fixed  value  of  1 
to  X  and  Y,  and  then  add  G  (G)  as  a  new  column  (row)  of  G. 

The  matrix  G  has  a  row  for  each  state-action  pair  (s,  a)  in  Mx  and  similarly  a  column 
for  each  state-action  pair  (t,  b)  in  My.  We  can  interpret  this  entry  G{{s,  a),  (t,  b))  as  the 
cost  associated  with  the  product  of  the  number  of  times  that  x  took  action  a  from  s  and  y 
took  action  b  from  state  t.  In  this  way  we  have  a  convex  game  equivalent  to 


min  max  V{x,y),  (3.18) 


the  cost-paired  MDP  game. 


Modeling  costs  in  the  mobile-sensor  game  We  might  model  a  mobile  sensor  place¬ 
ment/avoidance  problem  in  the  following  way:  both  MDPs  Mx  and  My  have  the  same 
state-space  S;  time  is  explicitly  encoded  in  the  state  space,  so  each  s  e  S  corresponds  to 
being  at  a  particular  location  loc(s)  at  a  particular  time  time(s).  Player  y  is  the  sensing 
player;  when  he  is  in  state  t  he  can  observe  all  of  the  states  in  obs(t)  C  S.  In  particular, 
s  E  obs(t)  if  and  only  if  time(s)  =  time(f)  and  loc(s)  is  visible  from  loc(t).  Then  for  each 
state  t,  we  define 

Vs  G  obs(t),  G{{s,a),{t,b))  =  z  (3.19) 

for  all  a  E  M  and  all  b  E  A'^ ,  where  2:  G  M  is  the  cost  associated  with  y  observing  x  for  a 
single  timestep. 

Consider  state-action  visitation  frequencies  x  E  X  and  y  E  Y .  Then  let  Xg  = 
XlaeAx  Vt  =  J2b&Ayy{t,b)-  Time  is  explicitly  encoded  in  the  state  space  and 

increases  after  each  action,  so  no  state  can  be  reached  more  than  once.  Thus,  we  can  in¬ 
terpret  Xg  as  the  probability  that  x  is  in  at  location  loc(s)  at  time  time(s),  and  similarly  for 
yt-  If  s  G  obs(t),  then  Xgyt  is  the  probability  that  y  observes  x  at  location  loc(s)  at  time 
time(s)  =  time(t)  from  location  loc(t);  our  definition  of  G  in  Equation  (3.19)  ensures  that 
x'^My  (when  multiplied  out)  contains  the  term  XgytZ  to  account  for  this  expected  cost. 
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3.5  Convex  Stochastic  Games 


Stochastic  games  (SGs,  also  called  Markov  games)  generalize  MDPs  to  multiple  players 
by  putting  a  matrix  game,  ealled  a  stage  game,  at  eaeh  state.  The  game  is  fully  observable, 
and  on  eaeh  round  the  players  play  the  stage  game  by  simultaneously  seleeting  a  row  or 
eolumn.  The  immediate  payoff  from  one  player  to  the  other  and  the  distribution  over  the 
next  state  in  the  MDP  are  both  funetions  of  the  pair  of  aetions  ehosen  in  this  way.  There 
is  a  large  body  of  literature  on  stoehastie  games:  Neyman  and  Sorin  [2003]  and  Owen 
[1995]  both  offer  a  good  general  starting  point,  while  Bowling  and  Veloso  [2000]  provide 
an  introduetion  from  a  reinforeement  learning  point  of  view.  Littman  [1994]  also  shows  the 
usefulness  of  stoehastie  games  as  a  model  for  multi-agent  learning.  Partially  observable 
stoehastie  games  (POSGs)  are  mueh  more  expressive  but  less  traetable  than  stoehastie 
games.  Reeent  researeh  has  shown  the  usefulness  of  POSGs,  see  Emery-Montemerlo  et  al. 
[2004]  and  Hansen  et  al.  [2004]  for  a  variety  of  applieations. 

In  this  seetion  we  introduee  Convex  Stoehastie  Games  (CSGs),  stoehastie  games  with 
eonvex  games  in  plaee  of  the  usual  matrix  stage  games.  This  allows  us  to  embed  extensive- 
form  games  (transformed  to  eonvex  games)  as  the  stage  games,  yielding  a  traetable  elass 
of  partially  observable  stoehastie  games. 

We  eonsider  the  zero-sum  ease  played  by  x  and  y.  The  eonvex  stoehastie  game  is 
played  on  a  set  S  of  states;  eaeh  s  G  S'  is  assoeiated  with  a  eonvex  stage-game.  The 
eonvex  game  at  s  has  aetions  sets  Xg  C  and  Yg  C  M”'*  for  x  and  y.  Next  state 
transition  probabilities  are  defined  via  non-negative  matriees  G  for  every  pair 

of  states  s,  s'  G  S.  The  probability  that  the  game  transitions  from  s  to  s'  given  that  x 
played  x  e  Xg  and  y  played  y  eYg  is  then  defined  to  be  x'^F^^'y,  so  we  require 

Vs  eS,yxe  Xg,  'iy  G  n,  x'^F^^'y  =  1 

s'eS 

and 

Vs, s'  eS,yxe  Xg,  Vy  G  Yg,  x^F^^'y  >  0. 

Payoffs  are  speeified  via  a  matrix  M^,  so  when  x  plays  x  and  y  plays  y,  the  payoff  from  x 
to  y  is  x^M^y.  It  is  straightforward  to  verify  that  CSGs  generalize  stoehastie  games.  The 
mapping  from  matrix  stage-games  to  eonvex  game  stage-games  is  given  exaetly  by  the 
transformation  deseribed  in  the  Seetion  3.1.  Suppose  Rg  and  Cg  are  the  row  and  eolumn 
strategies  from  the  original  matrix  game  at  s.  Then  Xg  =  X{Rg)  and  Yg  =  A{Cg),  the 
probability  simpliees  explieitly  representing  the  sets  of  possible  mixed  strategies  for  eaeh 
stage  game.  The  stoehastie  game’s  transitions  are  speeified  by  probabilities  Pr(s'  |  s,  i,j) 
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for  each  s,  s'  G  S  and  i  G  Rg,  j  G  C*.  We  construct  the  transition  matrix  by  letting 
F^j  =  Pr(s'  I  s,  i,  j).  Then  for  mixed  strategies  x  E  Xg  and  y  G  Yg, 

Pr(s'  I  s,x,y)  =  x^F^^'y. 

Given  a  discount  factor  7  G  (0, 1)  on  future  payoffs,  the  convex  stochastic  game  can  be 
solved  via  minimax  value  iteration  (this  is  a  straightforward  extension  to  Littman  [1994]). 

In  minimax  value  iteration  for  SGs,  game  values  are  calculated  by  solving  a  stage-game 
modified  to  take  into  account  estimated  future  payoffs.  The  convergence  of  the  algorithm 
follows  because  backups  are  a  contraction  given  7  <  1  [see  Owen,  1995,  for  a  proof]. 
These  same  results  carry  over  to  CSGs.  Let  v  G  be  the  current  value  function  estimate. 
To  backup  at  s,  we  solve  the  convex  game  at  s  with  the  modified  objective  function 

+  t'(s')  (3.20) 

s'es 

This  objective  includes  the  immediate  payoff  x"^ M^y  and  the  discounted  term  'yE[v{s') \x^  y], 
which  estimates  the  expected  cost  of  the  rest  of  the  game  given  that  x  and  y  are  played. 
Note  that  Equation  (3.20)  can  be  rewritten  as 


and  so  it  is  a  bilinear  function  of  x  and  y.  Thus,  we  can  implement  the  backup  operator 
for  minimax  value  iteration  by  solving  the  modified  stage  games  via  convex  programming, 
and  so  solve  discounted  CSGs. 

Solving  a  class  of  POSGs  Using  the  convex  game  transformation  reviewed  in  this  chap¬ 
ter,  we  can  embed  extensive-form  games  as  the  stage  games  in  convex  stochastic  games. 
We  can  view  this  overall  structure  as  an  EFG  with  loops,  and  can  “unroll”  this  embedding 
with  the  following  interpretation:  each  EEG  stage-game  corresponds  to  a  subgame  (both 
players  know  which  subgame  is  being  played).  These  subgames  have  partial  observability, 
but  after  a  subgame  completes  a  fully-observable  transition  is  made  to  another  subgame. 
However,  all  “back  edges”  must  be  to  nodes  that  begin  subgames.  This  game  is  a  fairly 
general  POSG:  it  has  partial  observability  and  states  can  repeat.  It  can  be  solved  in  poly¬ 
nomial  time  using  the  techniques  introduced  in  this  section.  The  key  is  that  the  periods  of 
partial  observability  are  of  bounded  duration  (equal  to  the  height  of  one  of  the  embedded 
EFGs);  the  solution  time  is  polynomial  in  the  representation  of  the  game,  but  possibly 
exponential  in  this  horizon  time. 
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For  some  interesting  applications,  assuming  that  only  short  periods  of  partial  observ¬ 
ability  occur  between  periods  of  full  observability  is  reasonable  —  for  example,  we  could 
plan  how  to  handle  a  temporary  failure  of  a  lighting  system,  GPS  localization,  or  other 
sensors.  We  could  also  model  a  two-player  poker  tournament  where  each  stage  game  cor¬ 
responds  to  playing  a  game  of  poker  with  a  fixed  (fully  observable)  initial  number  of  chips 
for  each  player.  The  result  of  each  poker  game  produces  a  fully  observable  transition  to 
another  game  (where  the  number  of  starting  chips  for  each  player  depends  on  the  outcome 
of  the  last  game),  or  the  end  of  the  tournament  (say,  if  one  player  runs  out  of  chips). 

However,  if  the  partial  observability  in  some  domain  is  due  to  an  adversary  that  may  go 
unobserved  for  long  periods  of  time,  this  approach  will  not  produce  tractable  games.  After 
we  introduce  CEFGs  in  the  next  chapter,  we  will  argue  that  more  realistic  problems  can 
be  modeled  by  embedding  CEFGs  as  stage  games  in  a  convex  game,  because  they  provide 
a  powerful  method  for  treating  a  complex  sequence  of  decisions  as  a  single  decision,  thus 
decreasing  the  effective  horizon  of  the  embedded  games. 
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Chapter  4 

Generalizing  Extensive-form  Games 
with  Convex  Action  Sets 


This  chapter  develops  the  elass  of  convex  extensive  form  games  (CEFGs).  These  games  are 
a  powerful  generalizations  of  extensive-form  games  that  ean  still  be  solved  in  polynomial 
time  in  the  size  of  the  game  representation,  under  reasonable  assumptions.  Like  an  EFG, 
a  CEFG  is  a  game  with  partial  information  played  on  a  game  tree,  however,  in  CEFGs: 

1.  An  arbitrary  subset  of  players  simultaneously  seleet  aetions  at  eaeh  node,  mueh  like 
in  a  normal  form  game. 

2.  The  sets  of  aetions  available  to  eaeh  player  at  a  given  information  set  is  a  eonvex 
subset  of  M”,  rather  than  a  diserete  set. 

3.  Payoffs  are  made  at  internal  nodes  as  well  as  at  leaves,  and  are  given  by  a  multi¬ 
linear  funetion  of  the  players’  aetions. 

4.  A  linear  function  associates  a  product  distribution  over  suceessor  nodes  with  eaeh 
possible  joint  aetion. 

5.  Two  nodes  that  are  both  in  the  same  information  set  may  have  different  numbers  of 
suceessor  nodes. 

These  generalizations  allow  us  to  embed  arbitrary  eonvex  games  at  the  nodes  of  a 
CEFG.  This  effeetively  unifies  the  problems  of  planning  in  a  MDP  and  solving  for  the 
minimax  solution  of  an  EFG:  an  MDP  is  a  single-player,  single-node  CEFG,  and  an  EFG 
ean  be  transformed  to  a  CEFG  on  the  same  game  tree.  The  problem  of  solving  an  MDP 
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where  one  player  selects  a  policy  and  another  player  chooses  the  cost  function  was  ad¬ 
dressed  in  Section  3.4.  This  problem  can  be  modeled  as  a  two-player,  single-node  CEFG. 
More  general  versions  of  this  problem,  where  the  players  have  some  limited  opportunities 
to  observe  their  opponent’s  past  actions,  can  also  be  solved  as  CEFGs. 

This  unification  has  practical  applications  to  problems  typically  modeled  as  EFGs  as 
well.  In  particular,  our  results  make  it  possible  to  efficiently  model  games  with  outcome 
uncertainty.  Modeling  outcome  uncertainty  (where  a  single  action  can  result  in  a  distribu¬ 
tion  over  outcomes  rather  than  a  single  deterministic  outcome)  in  standard  EFGs  causes 
an  exponential  blowup  in  the  representation  size,  but  with  CEFGs  we  avoid  this  blowup. 
This  has  ramifications  both  to  computing  sequentially-rational  equilibria  and  opponent 
modeling.  We  discuss  these  and  other  applications  of  CEFGs  in  Section  4.4. 

Efficient  computation  of  equilibria  in  zero-sum  EFGs  depends  on  the  property  of  per¬ 
fect  recall;  the  problem  is  NP-hard  without  this  assumption  [Koller  and  Megiddo,  1992]. 
“Perfect  recall”  CEFGs  would  not  be  tractable  due  to  the  exponential  or  infinite  number 
of  possible  actions  at  each  node.  We  develop  a  generalization  of  perfect  recall  for  CEFGs, 
sufficient  recall,  that  allows  some  “forgetting”  of  past  actions.  A  principal  contribution  of 
this  work  is  showing  that  sufficient-recall  zero-sum  CEFGs  can  be  transformed  to  convex 
games  and  hence  solved  efficiently. 

We  define  sufficient  recall  as  the  combination  of  observation  memory  and  sufficient 
action  memory  (we  define  all  these  terms  in  Section  4.2),  and  then  show  that  this  formu¬ 
lation  is  equivalent  to  a  notion  of  sequence  recall  that  is  more  akin  to  the  definition  of 
perfect  recall  in  EFGs.  Thus,  this  result  serves  as  a  new  characterization  of  perfect  recall 
for  EFGs.  Other  characterizations  of  perfect  recall  have  recently  appeared  in  the  literature 
(see  Bonanno  [2004]  and  the  references  therein).  Some  of  these  characterizations  may  be 
related  to  our  characterization  when  it  is  applied  to  EFGs  represented  as  CEFGs,  but  as  of 
yet,  we  have  not  investigated  this  relationship. 

We  are  not  aware  of  any  similar  generalizations  of  extensive-form  games  currently 
in  the  literature.  Selten  [1999]  does  mention  a  multistage  game  model  where  multiple 
agents  select  actions  at  each  stage.  He  only  considers  the  perfect  information  case,  but 
mentions  that  “the  framework  could  be  made  as  general  as  that  of  an  extensive  game  by 
the  additional  introduction  of  information  partitions.”  However,  he  provides  no  additional 
discussion  of  the  methods  for  doing  this,  or  of  their  ramifications;  he  also  does  not  consider 
convex  action  sets. 
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Components  of  a  CEFG 

iV  =  {0,l,2,.. 

. ,  n}  the  n  players  of  the  game;  0  a  is  a  chance  player. 

T={V,E) 

the  extensive  form  game  tree  on  nodes  V  and  edges  E 

VpCV 

player  p’s  decision  nodes 

Up 

set  of  player  p’s  information  sets 

Xu  C 

convex  sets  defining  the  action  space  at  u  (u  defines  player) 

J^ss' 

linear  transition  function,  :  Xu  ^  [0,1] 

p 

payoff  function  at  node  s  for  player  p 

Additional  Notation 

P 

an  arbitrary  player,  p  E  N 

s,  t 

nodes  in  V 

^(s)  C  N 

set  of  players  active  at  s  G  C 

u  E  Up 

u  C  Vp,  an  information  set  for  player  p 

^  Up 

p’s  information  set  containing  s  if  s  G  V),;  Op  otherwise 

Xs 

set  of  joint  actions  possible  at  node  s 

^  Xu 

an  action  taken  at  u 

X^ 

cone  extension  of  set  Xu 

w{s  1  Tip) 

p’s  sequence  weight  on  state  s  under  policy  Tip 

w{u  Tip) 

p’s  sequence  weight  on  info  set  u  under  policy  Tip 

Table  4.1:  Summary  of  notation  for  convex  extensive-form  games. 


4.1  CEFGs:  Defining  the  Model 

In  this  section,  we  introduce  Convex  Extensive  Form  Games  (CEFGs)  with  n  players  and 
general  payoffs,  and  then  provide  brief  commentary  on  the  interpretation  of  the  model  and 
its  connection  to  EFGs. 

As  one  would  expect,  CEFGs  generalize  EFGs.  The  principal  differences  between  the 
two  representations  were  outlined  in  the  introduction.  In  this  section,  we  formally  define 
the  model  and  associated  notation.  While  we  mention  some  differences  to  EFGs  and 
sketch  a  transformation  from  a  EFG  to  an  equivalent  CEFG,  our  definition  of  the  model 
has  no  direct  dependence  on  EFGs.  Table  (4.1)  summarizes  the  components  that  define 
a  CEFG,  as  well  as  some  associated  notation  introduced  here  and  in  subsequent  sections. 
After  formally  introducing  the  model,  we  present  several  examples  and  interpretations  in 
order  to  make  the  definition  more  concrete,  show  the  connection  to  standard  EFGs,  and 
demonstrate  the  expressive  power  of  CEFGs. 
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The  game  tree  and  information  sets  A  CEFG  is  played  on  a  directed,  finite  game  tree 
T  =  {V,E)  rooted  at  s* .  For  any  s  e  V,  there  is  a  unique  s*  ^  s  path,  with  nodes 
denoted  by  path(s)  =  ■  ■  ■  ,s^)  where  =  s*  and  =  s,  and  edges  S{s)  = 

((s^,  s^),  (s^,  s^), . . . ,  s^).  The  game  is  played  by  a  set  =  {0, 1, 2, . . .  ,n}  of 

players,  where  the  0  player  is  an  optional  chance  (or  “nature”)  player.  We  generally  state 
results  for  an  arbitrary  player  p;  when  it  is  clear  to  which  player  we  are  referring,  we  omit 
the  subscript  p  to  simplify  notation. 

Fach  player  is  active  (selects  an  action)  on  an  arbitrary  subset  of  the  internal  (non-leaf) 
nodes, ^  Vp  ^  V;  these  are  player  p’s  decision  nodes.  Fet  ^(s)  =  {p  |  s  G  Vp},  the  set 
of  active  players  at  s.  We  require  |^(s)|  >  1  for  all  internal  nodes  s  and  |^(s)|  =  0  for 
leaves. 

As  in  EFGs,  the  decision  nodes  Vp  for  each  player  are  partitioned  into  information  sets 
Up.  Formally,  ^  Up,  we  have  m  ft  =  0  whenever  u  ^  u' . 

When  play  reaches  a  node  s  with  s  e  m  for  an  information  set  u  of  player  p,  then  player  p 
observes  (is  told)  that  the  game  has  reached  u,  but  the  specific  s  G  u  is  not  revealed;  that 
is,  all  s,  s'  e  u  are  indistinguishable  to  p. 

For  nodes  s  where  p  is  not  active  (s  ^  Vp),  we  define  (for  notational  convenience)  a 
special  “non-information  set”  Op-  In  particular.  Op  ^  Up  and  player  p  never  receives  Op  as 
an  observation.  For  any  node  s  G  Vp,  there  exists  exactly  one  information  set  u  &  Up  such 
that  s  G  m;  let  Op(-)  be  the  function  that  identifies  this  u,  that  is  for  any  s  G  Vp,4)p{s)  G  Up 
and  s  G  0p(s);  when  s  ^  Vp,  let  0p(s)  =  Op,  so  that  the  domain  of  (pp  is  all  of  V.  To 
simplify  notation,  when  u  is  not  otherwise  specified  it  can  be  read  as  0p(s). 

We  explicitly  exclude  the  property  of  absent-mindedness  (see  Piccione  and  Rubinstein 
[1997])  by  requiring  that  if  s,s'  G  Vp  are  on  some  path  to  a  leaf,  then  0p(s)  ^  4>p{s') 
(note  that  if  s,s'  ^Vp  then  0p(s)  =  Op(s')  =  Op,  but  this  doesn’t  matter).  For  any  state  s 
and  player  p,  let  obsp(s)  be  the  sequence  of  information  sets  that  occurs  on  the  path  to  s: 
obsp(s)  has  an  entry  u  for  each  0p(s*)  ^  Op- 

A  few  more  notes  on  notation:  to  indicate  that  a  particular  entity  belongs  to  a  particular 
player,  we  subscript  with  either  p  or  u,  for  example  Xu  for  an  action  or  TTp  for  a  policy.^ 

'One  can  imagine  a  matrix  game  is  played  at  each  node  by  some  subset  of  the  players,  though  as  we 
will  see  our  model  is  much  more  general  than  this.  The  fact  that  we  may  have  strict  subsets  of  players 
active  at  each  node  will  require  some  notational  gymnastics,  however,  this  is  necessary  to  maintain  a  direct 
transformation  from  EFGs  to  CEFGs;  we  will  return  to  this  point  once  we  have  fully  described  our  model. 

^This  is  not  technically  precise  in  the  case  of  Xu,  as  formally  u  is  simply  a  subset  of  V,  and  hence  two 
different  players,  say  p  and  p',  could  “share”  an  information  set,  that  is,  u  G  Up  and  u  G  Up'.  However, 
when  we  write  u  it  will  be  clear  from  context  that  u  is  associated  with  a  particular  player,  almost  always 
player  p. 
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We  indicate  an  entity  is  a  tuple  over  players  with  a  bar:  for  example,  a;  is  a  joint  action. 
Entries  in  a  tuple  over  players  are  indexed  with  a  subscript  p  G  iV,  and  shown  without  the 
bar.  That  is,  x  =  {xq,  xi,  X2,  ■  ■  ■ ,  Xn}-  We  use  a  bar  over  capital  symbols  to  denote  sets  of 
such  tuples,  for  example,  Xg  is  a  set  of  possible  joint  actions. 


Actions  and  costs  Consider  a  play  of  the  game  that  reaches  node  s,  and  suppose  u  = 
(j)p{s)  for  player  p  G  ^(s).  At  s  player  p  only  observes  u  (that  is,  p  cannot  differentiate 
between  the  nodes  in  u),  so  we  require  that  all  nodes  in  u  share  the  same  set  of  actions 
available  to  p.  However,  it  is  possible  for  two  nodes  s,s'  E  u  to  have  different  numbers  of 
successors  in  the  game  tree,  unlike  in  EFGs. 

In  an  EEG,  would  typically  be  a  small  finite  set;  a  principal  difference  in  CEEGs  is 
that  Xu  is  a  convex  subset  of  At  s,  each  player  p  G  ^(s)  selects  an  action  Xp  G 
Note  that  when  a  player  p  selects  an  action  at  u,  she  may  well  not  even  know  how  many 
players  are  simultaneously  selecting  an  action,  as  this  is  a  function  of  the  (unobserved) 
state  s  G  E.  Eor  notational  simplicity,  we  define  the  joint  action  x  =  (xq,  xi,  X2,  ■  ■  ■ ,  Xn) 
as  a  tuple  over  all  the  players,  where  we  have  Xp  G  Xu  for  p  G  .4,(s),  and  arbitrarily  fix 
Xp  =  1  for  p  ^  ^(s);  in  fact,  we  simply  define  =  {1},  and  so  the  set  of  all  possible 
joint  actions  at  s  is  =  <S>p£N  ®  denote  the  Cartesian  set  product). 

This  allows  us  to  view  joint  actions  as  a  tuple  of  actions  over  all  the  players,  even  though 
some  of  the  players  do  not  actually  make  a  decision  and  in  fact  have  no  knowledge  (other 
than  what  is  conveyed  later  via  their  information  sets)  of  the  fact  that  s  was  reached. 

Costs  are  incurred  at  internal  nodes,  not  just  at  leaves  as  for  EFGs.  The  payoff  to  each 
player  p  (for  all  p  G  X,  not  just  those  p  G  ^(s))  is  given  by  a  function  :  X^  — M.  We 
require  that  be  a  multi-linear  function  of  Xq,  Xi, . . . ,  x„,  that  is,  it  is  a  linear  function  of 
Xp  when  the  actions  of  all  other  players  are  held  constant.^  Note  that  this  definition  of  the 
action  sets  and  payoffs  implies  that  a  convex  game  is  being  played  at  each  internal  node. 
However,  the  players  involved  may  have  uncertainty  about  the  game:  they  know  their  own 
Xu,  but  may  not  know  the  payoff  matrix  (bilinear  payoff  function  for  n  >  2),  which  other 
players  are  playing,  and  what  actions  those  players  have  available. 

Eeaf  nodes  have  ^(s)  =  0,  and  so  the  payoff  to  each  player  is  a  constant,  written  as 
Mp(l)  where  1  is  the  vector  of  (n  +  1)  Is. 


Successors  and  transitions  While  we  may  have  an  infinite  set  of  possible  actions  X„, 
we  wish  to  avoid  infinite  branching  in  the  game  tree  T.  Thus,  we  assume  each  internal 

^The  assumption  is  necessary  for  efficient  computation  on  G,  and  is  also  necessary  to  show  the  equiva¬ 
lence  of  explicit  behavior  and  implicit  behavior  policies  (introduced  later). 
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node  s  e  V  has  a  finite  set  of  suceessors,  denoted  succ(s).  Given  a  joint  aetion  x  at  s,  we 
need  to  speeify  how  the  suceessor  state  s'  G  succ(s)  is  ehosen.  We  do  this  via  a  product 
distribution  over  succ(s)  that  is  a  linear  function  of  each  player’s  individual  action.  In 
particular,  for  each  p  G  ^(s)  and  s'  G  succ(s),  we  define  a  linear  function  M. 

The  probability  that  s'  is  the  next  state  after  s  is  given  by: 

Pr(s'|s,x)=  Yl  (4-1) 

P&A{s) 

Thus,  we  require  that  these  functions  satisfy  the  following  constraints  for  all  s  G  1^  and 
X  G  Xs'- 

n  fp^'K)  >0  and  n  =  1-  (4.2) 

pe.4(s)  s'6succ(s)  pe.4(s) 

Again,  to  avoid  special  cases  we  define  fp"^'  for  all  p  ^  ^(s)  as  fp^'{xp)  =  1,  that  is, 
the  identity  function,  since  =  {1}.  Thus,  we  can  replace  the  products  over  p  G 
^(s)  in  Equation  (4.2)  with  products  over  p  G  X.  Note  that  the  fp^'  functions  can,  for 
example,  be  constant  functions  specifying  a  fixed  probability  distribution,  and  so  functions 
satisfying  the  constraints  (4.2)  always  exist.  For  leaf  nodes  s,  we  assume  |^(s)|  =  0,  and 
no  transition  functions  are  defined. 

The  assumptions  in  Equation  (4.2)  are  sufficient  for  the  CEFG  to  be  well  defined,  that 
is,  they  specify  a  valid  probability  distribution  over  successors  that  is  a  product  distribu¬ 
tion.  However,  the  model  will  be  difficult  to  interpret  if  some  fp^  (x)  >  1.  Thus,  in  this 
paper  we  make  the  following  assumption: 

Assumption  4.1.1.  For  all  (s,  s')  G  E,  all  p  G  .4,(s),  and  all  Xp  G  X^,  fp^'{xp)  G  [0, 1]. 

This  assumption  allows  us  to  view  fp^'{x)  as  the  probability  of  some  event  depending 
only  on  player  p’s  action  x  (we  explore  this  idea  in  more  detail  in  the  next  section).  This 
assumption  is  in  fact  made  without  loss  of  generality;  see  Appendix  A  for  the  proof. 

Eet  G  and  G'  be  two  CEFGs.  We  say  that  G  and  G'  are  f  -equivalent  if  they  are 
identical  except  for  their  /  functions,  and  for  all  (s,  s'),  for  all  x  G  Xg,  PrG'(s'  |  s,  x)  = 
PrG/(s'  I  s,  x). 

We  model  the  random  player  as  having  a  separate  information  set  for  each  of  her 
decision  nodes,  and  fix  X^  =  {1}  for  all  m  G  Uq.  Thus,  the  random  player  makes  no 
decisions,  and  is  defined  by  her  (effectively)  constant  /-functions,  . 

Gameplay  and  payoffs  A  CEFG  is  played  in  a  similar  fashion  to  an  EFG.  We  can 
imagine  a  referee  who  starts  the  game  at  s*  =  s/  All  players  in  p  G  .4,(s*)  simultaneously 
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and  independently  select  an  action  from  (s*),  and  the  referee  assembles  these  choices 
into  a  joint  action  vector  x^,  using  1  as  the  action  for  all  players  not  active  at  s*.  The 
referee  then  computes  the  payoff  Mp{x^)  to  each  player  p  and  selects  the  successor  state 
based  on  x^  according  to  Equation  (4.1).  Each  player  p  G  receives  as  an 

observation  and  is  asked  for  an  action,  and  the  game  continues  in  this  fashion  until  a  leaf 
is  reached. 

The  partial  history  h  of  the  gameplay  so  far  can  be  written  as 


h  =  ((s\  x^),  a;^), . . . ,  (s^  x^), 


where  is  always  the  root  node  of  the  game.  A  complete  history  or  play  is  a  partial 
history  where  is  a  leaf  node.  Such  a  history  can  be  interpreted  as  saying  for  each 
tuple  i:  node  s*  was  reached,  each  player  p  G  observed  their  information  set  0p(<s*), 
and  then  played  Xp  G  The  game  transitioned  to  some  successor  of  s*  with 

Pr(s*+^  I  s*,  x*)  >  0.  The  value  of  a  history  h  to  player  p  (that  is,  the  total  payoff  to  p)  is 
given  by 

i4,(A)=  ^  m;{x). 

{s,x)£hp 

To  avoid  notational  hassles,  if  the  last  state  in  the  partial  history  h  is  a  leaf,  we  assume 
it  is  associated  with  a  vacuous  joint  action  so  it  is  included  in  this  sum  and  so  the  final 
(constant)  payoff  is  counted.  This  value  can  be  thought  of  as  the  sum  of  the  payoffs  of  the 
individual  convex  games  played  along  the  path  to  the  leaf.  The  goal  of  each  player  in  the 
game  is  to  maximize  their  own  total  payoff,  Vp{h). 

It  is  also  useful  to  define  the  partial  player  history,  hp,  the  portion  of  the  history  h 
observable  by  player  p. 


hp  =  {{u\  xl),  (m^  xl),...  (M^  xj),  M^+^) 

ending  in  a  player  p  information  set.  The  partial  player  history  only  contains  tuples  cor¬ 
responding  to  observations  p  received,  that  is,  it  has  no  tuples  where  is  Op-  Each  tuple 
i  can  be  read  as:  player  p  observed  G  Up,  selected  action  x*  G  X^i,  and  then  at  some 
later  point  was  “woken  up”  with  the  observation  of  If  /i  is  a  history  of  k,  the  length 
of  the  partial  player  history  for  any  particular  player  p  may  have  length  much  less  than  k, 
possibly  even  zero."^  Eet  H  be  the  set  of  all  possible  complete  plays  of  the  CEEG,  and  let 
Hp  be  the  set  of  all  partial  player  histories  for  player  p. 

“^In  an  EFG,  length(/i)  =  length(/ip),  as  each  internal  node  is  in  exactly  one  decision  set. 
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Figure  4.1:  A  simple  poker  game  with  a  two  card  deck  (Ace  and  King),  represented  as  a 
convex  extensive-form  game. 


Interpretations  and  Examples 

In  this  section  we  present  several  small  examples  and  demonstrate  the  connection  to  stan¬ 
dard  EFGs. 


A  simple  poker  game,  revisited  We  show  how  to  represent  the  poker  game  from  Sec¬ 
tion  3.2  as  a  convex  extensive-form  game.  The  CEFG  representation  is  given  in  Eigure  4. 1 ; 
this  figure  is  essentially  the  same  as  Eigure  3.2,  but  we  have  renumbered  the  states  (in¬ 
cluding  the  leaves),  and  do  not  show  payoffs  for  clarity  (they  are  identical).  We  have 
V  =  {1, . . . ,  17},  with  Ex  =  {2,  3, 10, 11}  and  V}  =  {5, 6}.  The  information  sets  are 
again  74  =  {ui^us}  and  Uy  =  {^2},  where  ui  =  (2,  3},  U2  =  {5,  6},  and  M3  =  {10, 11}. 
The  labels  on  edges  are  for  reference  only,  and  cannot  in  general  be  viewed  as  actions 
in  a  convex  extensive  form  game.  However,  in  Section  4.2,  we  will  introduce  the  con¬ 
cept  of  outcomes  which  partition  the  edges  out  of  information  sets.  The  labels  shown  in 
the  figure  can  be  interpreted  as  outcome  labels.  The  action  sets  are  =  A({/,  6}), 
Xu2  =  and  Xu^  =  C"}).  The  transition  functions  associated  with 
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state  2  are 


=  Xi 

/2.5(a;)  =  a;2 


=  1 

/y’^(l/)  =  1, 


and  it  is  easy  to  verify  that  in  fact  /x’"^(a^)/y ’^(y)  >  0,  f^’^{x)fy'^{y)  >  0,  and 

for  all  X  G  A({/,  b})  and  y  G  =  {!},  as  required  by  Equation  (4.2). 

For  a  general  convex  extensive-form  game  the  transition  functions  from  states  2  and 
3  could  be  different,  and  in  fact  each  state  could  have  a  different  number  of  outcomes; 
but,  in  this  case  we  are  simply  transforming  an  EFG  and  so  the  functions  will  be  identical. 
More  precisely,  we  have  (corresponding  to  the  fold  outcome),  and 

(corresponding  to  the  bet  outcome).  Equalities  of  this  type  will  play  a  central  roll  in  our 
definition  of  sufficient  recall. 

We  have  omitted  the  constant  /  =  1  functions  of  the  random  player  at  node  2;  at 
node  1,  only  the  random  player  is  active,  with  /q’^  =  0.5  and  /q’^  =  0.5.  The  payoff 
functions  M  at  internal  nodes  are  the  constant  zero  function,  and  the  payoffs  at  the  leaves 
are  constant  functions  corresponding  to  the  payoffs  shown  in  Figure  (3.2). 


A  useful  interpretation  of  the  /-functions  Imagine  that  at  a  node  s,  player  p  selects  an 
action  Xp  G  and  that  action  the  action  Xp  determines  a  probability  distribution  over 
some  disjoint  events^  Ap  =  {oi,  02, ... ,  dkp}-  This  distribution  is  independent  of  the  other 
player’s  action  if  the  other  player  happens  to  be  active  at  s.  Suppose  that  this  probability 
distribution  is,  in  fact,  a  linear  function  of  x,  and,  WEOG,  xi, . . .  Xkp  are  the  probabilities^ 
of  oi, . . . ,  Ofcp,  that  is  Xi  =  Pr(aj  |  x).  The  /-functions  are  linear,  so  suppose 

kp 

ffi^)  ='^CiXi. 

i=l 

^Any  resemblance  to  the  notation  used  for  actions  in  an  EFG  is  purely  unintentional. 

®This  is  without  loss  of  generality  because  if  it  is  not  the  case  we  can  embed  the  linear  functions  that 
define  Pr(ai  |  x)  into  a  new  dimension  of  via  an  equality  constraint. 
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If  we  choose  coefficients  q  G  {0, 1}  then  each  function  simply  adds  up  the  probabili¬ 
ties  on  some  union  of  events  in  A.  Letting  Ap{s')  =  {a*  G  Ap  \  Ci  =  1},  we  have 

kp  kp 

ffi^)  =  '^CiXi  =  ^CiPr(ai  I  x)  =  Pr(Ap(s')  |  x). 

i=l  i=l 

Thus,  holding  other  player’s  action  constant,  the  probability  with  which  the  game  transi¬ 
tions  to  s'  is  proportional  to  the  probability  of  the  event  Ap{s'). 

We  can  independently  interpret  each  player’s  /-functions  in  this  way.  The  total  set  of 
/-functions  (one  for  each  player/successor  pair)  determining  the  transition  probabilities 
at  s  must  be  chosen  to  ensure  that  each  joint  event  (some  a  E  Ap  for  each  player  p) 
is  associated  with  exactly  one  successor  state  s'.  Since  the  distribution  on  each  Ap  is 
independent,  the  resulting  distribution  is  a  product  distribution. 

As  a  concrete  example,  consider  two  players,  x  and  y,  at  node  s,  where  player  x’s 
action  determines  a  distribution  on  events  A  =  {ai,  02},  and  player  y's  action  determines 
a  distribution  on  i?  =  {61,  &2}-  Suppose  x  has  action  set 

X  =  {(xi, X2)  \xi>Q,xi  +  X2  =  l}  =  A(A), 

that  is,  she  can  choose  any  distribution  she  wants  on  her  events,  and  similarly  Y  =  A (5). 
Suppose  from  s  there  are  3  possible  successors:  and  s^.  Then,  we  might  have 

Pr(s^)  =  Pr(ai)  and  Pr(s^)  =  Pr(a2  A  hi)  and  Pr(s^)  =  Pr(a2  A  62)-  The  corresponding 
/  functions  would  be: 

//’"'(x)  =  xi  //""(x)  =  0:2  //’*"(a:)  =  0:2 

/y {y)  =  yi  +  y2  /y (y)  =  yi  //’"'  (y)  =  1/2 

and  so 


Pr(s^)  =  xi{yi  +  1/2)  =  Pr(ai)  Pr(6i  V  62)  =  Pr(ai) 

Pr(s^)  =  a;2|/i  =  Pr(a2)  Pr(6i)  =  Pr(a2  A  &i) 

Ft{s^)  =  X2y2  =Pr(a2)  Pr(62)  =Pr(a2A&2)- 

The  relationship  between  the  successors  and  the  joint  probability  space  A  0  5 

is  shown  pictorially  in  the  following  table: 


hi  62 
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The  /-functions  can  always  be  represented  as  a  partition  of  the  product  space  of  the  indi¬ 
vidual  event  sets  Ap.  However,  arbitrary  partitions  are  not  possible:  each  successor  state 
must  be  associated  with  a  “rectangle”  of  the  product  event  space,  that  is,  each  successor  s' 
corresponds  to  some  Cartesian  product  of  subsets  of  each  Ap,  namely 

In  all  of  our  applications,  the  /-functions  can  be  interpreted  in  this  way.  For  example, 
consider  an  MDP  embedded  at  node  s  of  a  CEFG,  so  for  player  p  is  the  set  of  state- 
action  visitation  frequencies  corresponding  to  valid  stochastic  policies  (this  set  is  defined 
in  Section  3.4.3).  Further,  suppose  only  p  is  active  at  s.  If  the  MDP  has  k  terminal  states 
(absorbing  goal  states),  then  we  can  define  events  ai, ...  ,ak  where  a*  is  the  event  that  the 
policy  selected  takes  the  agent  to  terminal  state  i.  These  probabilities  are  a  linear  function 
of  a;  G  Xu,  and  so  we  might  have  k  successors  s',  where  each  function  computes  the 
probability  that  the  agent  reaches  a  particular  terminal  state  under  the  chosen  policy. 


Interpretations  of  the  action  sets  As  with  with  convex  games,  we  have  two  possible 
interpretations  of  the  set  Xu'. 

1.  We  interpret  the  set  Xu  as  a  continuous  set  of  primitive  actions. 

2.  We  treat  only  the  extreme  points  Cn(X„)  as  primitive  actions,  with  interior  points 
defining  an  equivalence  class  of  mixed  strategies.^ 

In  Section  (3.1)  we  showed  that  all  distributions  over  Cn(X)  (and  hence,  Xu  for  CE- 
FGs)  correspond  to  (immediate)  payoff  equivalent  actions  (strategies  for  the  convex  game). 
However,  in  CEEGs  we  must  also  worry  about  the  rest  of  the  game:  two  actions  that  lead 
to  the  same  immediate  payoff  but  produce  potentially  different  distributions  over  successor 
nodes  in  the  game  tree  are  clearly  not  equivalent.  However,  because  the  transition  func¬ 
tions  /  are  linear,  it  is  easy  to  modify  the  arguments  for  immediate  payoff  equivalence 
to  show  that  any  two  probability  distributions  over  corners  that  produce  the  same  interior 
point  X  also  produce  the  same  distribution  over  successor  nodes  for  any  (fixed)  joint  pol¬ 
icy  for  the  other  players.  Hence,  even  if  we  consider  the  set  Cn(X^j)  to  be  the  actual  set 
of  primitive  actions,  we  can  optimize  over  the  set  Xu  and  sample  from  a  small  support 
representation  of  an  interior  point  if  needed. 


Connection  to  EFGs  In  EEGs,  no  generality  is  gained  by  assigning  costs  at  internal 
nodes  s,  as  we  can  always  simply  add  any  immediate  costs  at  s  to  the  cost  at  every  leaf 

^As  with  convex  games,  we  could  use  any  subset  X'  Q  X  instead  of  Cn(Ar„),  as  long  as  we  have  an 
efficient  algorithm  to  express  any  point  a;  G  A  as  a  distribution  over  points  in  X' . 
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reachable  from  s.  However,  in  CEFGs  the  costs  we  incur  may  depend  on  the  exact  action 
Xu,  not  just  the  successor  state,  and  so  deferring  costs  to  the  leaf  would  require  “remem¬ 
bering”  the  action  Xu  in  the  tree,  giving  rise  to  an  infinite  branching  factor.  By  assigning 
some  cost  based  on  Xs  immediately  at  the  internal  node  s,  our  model  can  handle  costs  that 
depend  on  continuous,  multi-dimensional  actions.  The  key  is  that  the  full  action  impacts 
only  the  immediate  cost  incurred,  and  that  once  that  cost  is  accounted  for,  only  a  finite 
number  of  successor  states  are  possible. 

The  fact  that  we  have  continuous,  multi-dimensional  actions  has  significant  ramifica¬ 
tions  for  tractable  algorithms.  The  efficient  solution  of  two-player,  zero-sum  EFGs  re¬ 
lies  on  the  assumption  of  perfect  recall:  informally,  each  information  set  u  for  player  p 
uniquely  determines  the  sequence  of  actions  p  has  taken  up  to  that  point.  Clearly,  we  will 
not  be  able  to  have  this  property  given  continuous  actions  and  a  finite  tree — some  for¬ 
getting  must  be  allowed.  Thus,  in  the  next  section  we  introduce  the  concept  of  sufficient 
recall,  which  provides  enough  “memory”  in  the  tree  to  allow  optimal  play  given  only  the 
current  information  set  without  fully  encoding  all  past  actions. 

Finally,  we  note  that  CEFGs  generalize  EFGs.  In  particular,  for  any  EFG  G  there  is  a 
mapping  to  an  equivalent  CEFG  G' ,  such  that  an  equilibrium  solution  in  G'  can  be  mapped 
back  to  an  equilibrium  solution  of  G.  The  mapping  from  an  EFG  G  to  a  CEFG  G'  is  the 
natural  one.  Each  node  in  G  corresponds  to  a  node  in  G';  information  sets  are  also  mapped 
directly,  so  that  in  G'  we  will  have  |^(s)  |  =  1  for  all  internal  nodes,  and  |^(s)  |  =  0  at  the 
leaves.  Each  internal  node  has  a  payoff  function  Mp{x)  =  0,  and  each  leaf  has  a  constant 
payoff  Mp(l)  G  M  equal  to  the  payoff  at  the  corresponding  leaf  in  G. 

For  each  internal  node  s,  if  the  active  player  p’s  choices  in  G  for  m  =  0p(s)  were 
Gu  =  {ci, . . .  ,Ck},  we  define  Xu  C  to  be  the  A; -dimensional  probability  simplex 
A(G„).  The  comers  Cn(X^,)  correspond  to  the  choices/outcomes  in  G,  while  an  interior 
point  X  corresponds  to  a  behavior  (a  distribution  over  choices).  The  successors  of  s  are 
(say)  ...  ,s^,  and  the  /-functions  are  simply  ff{x)  =  Xi. 

It  is  easy  to  show  that  this  mapping  produces  a  valid  CEFG,  and  that  policies  can  be 
transferred  back  and  forth  between  G  and  G' ,  with  payoffs  being  identical  in  G  or  G'. 


Compact  representations  with  CEFGs  We  provide  a  quick  illustration  of  the  power  of 
the  CEFG  model  by  showing  there  are  games  with  a  CEFG  representation  that  is  exponen¬ 
tially  smaller  than  the  EFG  representation.  Consider  the  following  zero-sum  game  played 
by  X  and  y  in  /c  rounds.  On  each  round,  each  player  chooses  a  number  from  {1, . . . ,  10}. 
At  the  end  of  k  rounds,  player  x  pays  y  an  amount  that  depends  arbitrarily  on  the  sequence 
of  k  numbers  x  picked,  plus  $I  for  each  time  y  guessed  x’s  action.  Neither  player  observes 
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the  other  player’s  past  actions. 

This  game  can  be  represented  as  an  EFG,  played  on  a  height  2k  tree  with  branching 
factor  10:  there  is  one  leaf  for  each  possible  pair  of  length  k  sequences,  corresponding 
to  the  numbers  picked  by  x  and  y.  Thus,  this  tree  has  10^^  leaves.  This  game  has  an 
exponentially  smaller  representation  as  a  CEFG:  the  game  tree  is  now  of  height  k,  and 
each  leaf  corresponds  to  the  sequence  of  x’s  choices;  the  choices  of  y  are  “forgotten”  by 
the  tree.  Each  internal  node  is  a  matrix  game  where  x  pays  $1  if  y  guesses  his  actions 
(that  is,  is  the  10  x  10  identity  matrix).  Thus,  the  possible  immediate  payoff  of  $1 
depends  on  both  actions,  but  the  successor  state  depends  only  on  x’s  action  (all  =  1). 
An  additional  constant  payoff  is  given  at  each  leaf  based  on  the  sequence  of  x’s  choices. 
This  tree  has  only  10^  leaves,  and  so  the  representation  is  exponentially  smaller  than  the 
EFG  (encoding  the  matrix  games  only  increases  the  size  by  a  constant  factor). 

This  game  might  model  a  situation  where  y  is  placing  bets  on  player  x’s  location, 
while  player  x  is  trying  to  accomplish  some  task.  More  realistic  games  can  certainly  be 
constructed;  the  goal  here  is  to  illustrate  the  representative  power  of  CEFGs.  Note  that  we 
have  not  fully  exploited  the  power  of  this  model:  our  action  sets  were  still  only  probability 
simplexes,  and  all  nodes  in  an  information  set  had  the  same  number  of  successors. 


4.2  Sufficient  Recall  and  Implicit  Behavior 
Reactive  Policies 

In  this  section,  we  develop  some  important  theoretical  results  concerning  CEFGs.  Our 
principal  result  will  be  developing  a  notion  of  sufficient  recall  (analogous  to  perfect  recall 
in  EFGs)  and  showing  that  for  CEFGs  with  sufficient  recall,  a  class  of  behavior  strategies 
always  contains  an  optimal  policy.  These  results  allow  us  to  construct  a  polynomial-time 
algorithm  for  solving  sufficient-recall  CEFGs  in  the  next  section. 


Policies  and  Probability 

In  this  section,  we  formally  define  payoffs  for  a  CEFG,  define  some  policy  classes,  and 
define  payoff  equivalence. 


Policy  classes  A  policy  (or  strategy)  is  a  complete  description  of  how  to  play  a  game;  in 
the  case  of  CEFGs,  it  is  a  means  of  selecting  an  action  x  e  at  each  information  set  u 
that  occurs.  We  can  think  of  policies  as  functions  or  programs  depending  on  our  point  of 
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view.  Generally  speaking,  however,  any  reasonable  policy  must  select  its  action  at  u  based 
only  on  its  past  observations  and  actions,  and  possibly  some  source  of  randomness. 

We  formalize  policies  as  functions  and  define  terminology  by  specifying  different  pos¬ 
sibilities  for  the  domain  and  range.  A  policy  function  is  history  dependent  if  it  takes  as 
input  the  partial  player  history  so  far,  that  is,  its  domain  is  Hp.  A  policy  is  reactive  (or 
memoryless)  if  it  only  depends  on  the  current  information  set,  but  not  on  its  past  actions  or 
observations.  For  the  range:  a  pure  policy  chooses  a  single  action  from  the  set  of  primitive 
actions^  Cn(X„).  An  implicit  behavior  policy  picks  an  interior  point  of  X^,  and  interprets 
this  point  as  a  distribution  over  the  primitive  actions  Cn(X„)  (that  is,  it  samples  a  corner 
from  an  arbitrary  probability  distribution  from  the  equivalence  class  of  such  distributions 
defined  by  the  interior  point).  Finally,  an  explicit  behavior  policy  specifies  a  particular 
distribution  over  Cn  (Xy).  An  explicit  behavior  policy  might  put  positive  probability  on  an 
exponential  number  of  corners  of  Xy,  but  an  implicit  behavior  at  u  can  always  be  repre¬ 
sented  concisely. 

These  choices  give  us  6  classes  of  policy  functions,  for  each  combination  of  domain 
and  range.  When  naming  policies  we  specify  the  range  first,  then  the  domain,  and  so 
refer  to:  pure  history  policies,  implicit  behavior  history  policies,  implicit  behavior  reactive 
policies,  and  so  on.  A  mixed  policy  is  defined  by  a  probability  distribution  over  one  of 
the  above  classes.  Considering  the  mixed  versions  of  the  above  classes  gives  a  total  of  12 
policy  classes.  Fortunately,  we  will  only  need  to  focus  on  a  few  of  these  classes. 

The  literature  on  EFGs  generally  considers  mixed,  pure,  and  behavior  strategies.  An 
EFG  pure  policy  is  a  pure  reactive  policy  in  our  terms,  an  EEG  mixed  policy  is  a  mixed 
pure  reactive  policy,  and  an  EEG  behavior  policy  is  an  (implicit  or  explicit)  behavior  reac¬ 
tive  policy.^  The  set  of  general  policies  is  the  union  of  the  policy  classes  just  mentioned: 
it  can  be  thought  of  as  the  set  of  all  possible  strategies  a  player  could  use.  We  will  be 
particular  concerned  with  the  class  of  implicit  behavior  reactive  policies  (IBRPs),  which 
are  policies  specified  by  a  function  from  the  current  information  set  u  to  the  set  Xy.  We 
often  denote  such  policies  by  [3,  and  write  /3{u)  G  Xy  for  the  action  selected  at  u. 

We  use  Kp  to  denote  a  general  player  p  policy.  We  write  R-p  for  a  joint  policy  for  all 
players  except  p,  that  is. 


K—p  (^1)  ^2)  •  •  •  )  1)  ^p+1)  •  •  •  )  ^n) 

*For  simplicity  we  assume  the  set  of  primitive  actions  is  Cn(X„),  as  this  is  the  most  common  case.  But 
for  some  games  it  might  be  all  of  Xy  or  some  other  subset  of  Xy. 

®In  an  EFG,  the  set  Xy  is  the  probability  simplex  over  choices,  and  so  there  is  a  one-to-one  correspon¬ 
dence  between  interior  points  and  distributions  over  corners.  Hence,  the  class  of  implicit  behavior  and 
explicit  behavior  policies  are  identical  in  EFGs  (even  if  represented  as  CEFGs) 
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and  let  (/tp,  n^p)  be  the  joint  policy  where  players  other  than  p  play  according  to  n^p  and 
player  p  follows  Kp. 


Probability  We  introduce  the  basic  probability  measure  used  for  probability  statements 
about  CEFGs,  and  also  establish  our  notation  for  various  events.  For  simplicity,  we  as¬ 
sume  each  policy  only  ever  plays  actions  from  a  countable  subset  of  Xu.  This  simplifies 
the  notation  and  proofs  in  the  next  section  by  allowing  us  to  always  work  directly  with 
probabilities  rather  than  probability  densities;  it  also  makes  the  connection  to  results  for 
EFGs  more  clear.  In  particular,  this  assumption  allows  us  to  work  with  the  probability  that 
a  policy  picks  a  certain  action  given  that  u  is  reached,  Pr(a;  |  u).  Based  on  this  assump¬ 
tion,  we  abuse  notation  slightly  by  writing  sums  like  Yhx&Xu  I  when  we  implicitly 
mean  only  summing  over  only  those  x  G  Xu  that  the  policy  might  actually  play. 

With  suitable  attention  to  technical  detail,  these  results  should  go  through  for  policies 
that  select  actions  from  all  of  Xu  by  working  with  the  appropriate  probability  densities.  In 
fact,  for  the  case  of  polytopes,  the  restriction  to  policies  that  play  from  a  countable  subset 
of  Xu  is  without  loss  of  generality:  any  policy  that  sometimes  plays  interior  points  can  be 
interpreted  as  a  policy  that  plays  a  distribution  over  extreme  points. 

Any  joint  policy  k  =  {ki,  . . . ,  Kn}  where  each  player  fixes  some  general  policy  Kp 
induces  a  probability  distribution  on  H.  When  we  want  to  make  it  clear  which  joint  policy 
is  associated  with  a  given  probability  or  expectation,  we  include  the  policy  as  a  condition, 
for  example,  Pr(s  |  k);  subscripting  Pr  would  be  more  precise,  but  is  typographically 
cumbersome. 

For  a  fixed  k,  Vp  is  a  random  variable,  and  the  expected  payoff  Vp  to  player  p  under 
joint  policy  k  is 

V„(«i)  =  E%]. 

When  a  state  s  appears  where  an  event  (a  subset  of  H)  is  appropriate,  we  treat  s  as  the 
subset  of  the  complete  histories  in  H.  in  which  s  occurs  (s  is  reached  at  some  point  in  the 
play).  We  denote  by  -is  the  complement  of  this  set,  the  set  of  plays  in  which  s  does  not 
occur.  Similarly,  we  view  an  information  set  u  for  player  p  as  the  subset  of  plays  in  H 
where  some  s  G  m  is  reached;  in  this  context  u  =  U^g^s.  We  write  (m,  Xp)  for  the  event 
that  player  p  plays  Xp  G  Xu  from  information  set  u.  When  u  and  p  are  clear  from  context, 
we  write  simply  x  for  this  event. 

A  policy  Kp  for  player  p  is  payojf  equivalent  to  another  policy  Kp,  if  for  all  R-p  for 
the  other  players,  for  all  players  q  E  N,  Vq{Kp,  R-p)  =  Vq{Kp,  R-p).  This  definition  of 

'°Some  care  must  still  be  taken  when  dealing  with  an  unbounded  polyhedron  X. 
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equivalence  is  standard  [see  Dalkey,  1953,  for  example].  Finally,  we  state  one  reasonable 
additional  assumption: 

Assumption  4.2.1.  For  every  s  E  V,  there  exists  at  least  one  joint  policy  k  such  that 
Pr(s  I  k)  >  0. 

If  a  CEFG  of  interest  does  not  satisfy  this  assumption,  unreachable  states  can  be  re¬ 
moved  in  a  pre-processing  step. 


Sequence  Weights 


In  this  section,  we  prove  some  basic  results  that  apply  to  all  CEFGs,  even  those  without 
sufficient  recall.  In  particular,  we  introduce  a  generalized  notion  of  sequence  weights, 
which  we  then  use  to  show  that  the  probability  that  a  given  state  is  reached,  Pr(s  |  R),  is 
given  by  a  product  distribution. 

Eirst,  we  prove  an  important  structural  lemma  that  shows  that  we  can  calculate  the 
value  of  the  game  by  summing  the  expected  payoff  at  each  state  weighted  by  the  proba¬ 
bility  that  the  state  is  reached.  This  combined  with  representing  the  probabilities  of  each 
state  as  a  product  distribution  identifies  the  problem  structure  which  we  later  exploit  to 
obtain  an  efficient  algorithm. 

Define  Xg  to  be  a  random  vector  giving  the  joint  action  taken  at  s  when  s  E  h,  that  is, 
Xs{h)  =  x'  when  the  tuple  {s,x')  appears  in  h.  When  s  ^  h,  Xs{h)  is  undefined.  Thus, 
when  we  use  Xg  in  expectations  or  probabilities,  we  will  always  condition  on  the  fact  that 

s  E  h. 


Lemma  4.2.2.  For  any  joint  policy  k  and  any  player  p,  let 

=  {s  I  s  G  E,  Pr(s  I  R)  >  0}. 


Then, 


I  K)E[Mp{xs)  I  s,  k]. 

s&R 


Proof.  Define  random  variables  M  for  s  G  E,  p  G  iV,  by 


= 


if  s  E  h 
otherwise 


^'The  name  is  by  analogy  to  sequence  weights  in  EFGs,  see  Roller  and  Megiddo  [1992]  and  Roller  et  al. 
[1994]  in  particular. 
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Then, 


E[v;]  =  Pr(s)E[n;  |  s]  +  Pr(^s)E[n;  |  =  Pr(s)E[t;;  |  s]  =  Vi{s)E[M;{xs)  \  s] 

(4.3) 

because  E[vp{h)  \  -is]  =  0.  Also,  Vp{h)  =  J2s&T'^pi^)  occurs  at  most  once 

in  a  given  history  (since  T  is  a  tree).  Then  , 

V,(k)  =  B  [Vil  =  B  B  [!},■]  =  Pr(s  I  k)  B  [M;fe)  I  s]  ,  (4.4) 

.sev  J  s&v  s&v 

where  we  have  used  linearity  of  expectation  and  Equation  (4.3).  □ 

Now,  we  define  the  sequence  weight  Wp{s  |  /tp)  of  s  G  E  for  an  arbitrary  arbitrary 
policy  Kp  for  player  p.  Intuitively,  the  sequence  weight  Wp{s  \  Kp)  is  the  probability  we 
reach  s  given  that  all  other  players  (and  their  randomness)  “conspire”  to  force  us  to  s.  We 
formalize  this  notion  in  the  proof  of  the  next  lemma,  which  then  allows  us  to  formally 
define  sequence  weights  for  a  CEFG. 

Lemma  4.2.3.  For  any  player  p  using  policy  Kp  and  any  two  joint  policies  for  the  other 
players  R-p  and  R'_p,  for  any  s  E  Vp  where  Pr(s  |  (/tp,  R-p))  >  0  anJPr(s  |  (/tp,  R'_p))  > 
0,  we  have 

Pr((s,  Xp)  I  s,  {Kp,  R-p))  =  Pr((s,  Xp)  \  s,  {Kp,  K_p)). 

Proof  Eix  a  CEEG  G,  a  particular  state  s  E  V,  and  a  player  p.  We  construct  a  single¬ 
player  game  Q{p,  s)  as  follows:  The  game  includes  the  states  path(s)  =  s^,. .  .s^  (where 
is  the  root,  and  =  s).  Eet  S'  =  . . . ,  The  game  starts  at  At  each  s*  e  S', 

if  p  ^  .4,(s*),  the  game  always  continues  to  Thus,  G  is  really  only  defined  by  the 

states  E  G  14'Pl‘S'.  Ifp  G  .4,(s*),  the  player  chooses  an  action  a;  G  with  probability^^ 

fp'^"^^{x)  the  game  continues  to  and  otherwise  it  stops.  The  game  always  ends  if 
play  reaches  s(=  s^). 

Observe  that  we  can  use  any  policy  Kp  for  G  to  play  Q{p,  s):  at  each  s*  where  p  E 
.4,(s*),  we  tell  the  policy  it  is  in  information  set  u  =  0p(s*),  and  it  returns  an  action 
X  E  Xu.  In  fact,  there  is  no  way  for  the  policy  Kp  to  realize  it  is  not  being  used  to  play 
G.  Thus,  each  Kp  induces  a  probability  distribution  on  histories  of  Q  {p,  s)  (a  sequence  of 
actions  taken  up  until  the  end  of  the  game).  In  particular,  Prg(p,5)  (s*  |  Kp)  is  the  probability 
that  the  game  reaches  s*  under  kP. 

'^By  Assumption  (4.1.1),  fp’^'^^{x)  €  [0, 1] 
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We  can  also  consider  k’s  action  selection  in  Q{p,  s).  In  particular,  if  Prg(p  5)(s*  |  Kp)  > 
0,  then  Prg(p^s)(a^p  I  s\  Kp)  is  also  well  defined.  Whenever  Kp  selects  an  action  in  Q{p,  s),  it 
behaves  exactly  as  if  it  were  selecting  an  action  in  G;  the  lemma  follows  immediately.  □ 

As  a  consequence  of  this  lemma,  we  write  Pr(a;p  |  s,  Kp)  for  this  probability;  it  is 
defined  for  any  state  where  Prg(p  5)(s  |  Kp)  >  0.  We  then  define  the  sequence  weight  of  s 
given  Kp  by  w{s  \  Kp)  =  Prg(p^s)(s  |  Kp).  Note  that  if  the  path  to  s  contains  no  player  p 
information  sets  then  w{s  \  Kp)  =  1.  When  w{s  \  Kp)  >  0,  we  can  then  calculate  it  as 

w{s\Kp)=  Pr  (s  I  Kp)  =  TT  ^Pt{x  \t,Kp)fp'{x),  (4.5) 

where  we  have  Pr(l  |  s)  =  1  and  /p®'(l)  =  1  when  p  ^  ^(s)  (equivalently,  we  take  the 
product  to  only  be  over  edges  (s,  s')  where  s  G  Vp).  It  is  also  useful  to  define 

E[x  I  s,  Kp]  =  Pr(a;  |  s,  Kp)  x 

x&Xu 

(when  w{s  \  Kp)  >  0)  and  then  using  the  linearity  of  /,  we  have  for  nonzero  w{s  \  Kp), 

I  «p)  =  n  I  (4-6) 

{t,t')££{s) 

Define  REL(Kp)  =  {s  |  w{s  \  Kp)  >  0}.  This  is  exactly  the  set  of  states  such  that  there 
exists  a  k_p  such  that  Pr(s  |  (Kp,  k_p))  >  0  (this  can  be  proved  based  on  the  definition  of 
the  Q{p,  s)  game  and  Assumption  (4.2.1)).  Any  state  s  ^  REL(Kp)  is  ruled  out  by  Kp:  it 
is  never  reached  when  player  p  uses  Kp.  We  extend  the  REL  notation  to  joint  policies,  by 
defining  REL(k)  =  npREL(Kp)  =  {s  |  Pr(s  |  k)  >  0},  and  REL(k_p)  =  np/^pREL(Kp/). 

Lemma  4.2.4.  For  any  s  E  V  and  any  joint  policy  k, 

Pr(s  I  k)  =  ]^Wp(s  I  Kp). 

p 

Proof.  First,  observe  that  if  for  any  p  we  have  Wp{s  \  Kp)  =  0  then  s  ^  REL(Kp)  and 
Pr(s  I  k)  =  0,  and  so  the  equality  holds.  Now,  suppose  Pr(s  |  k)  >  0.  If  s  is  ever  reached 
under  k, 

Pr((s,  x)  I  s,  k)  =  Pt{xp  I  s,  Kp) 

p 
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as  a  consequence  of  Lemma  (4.2.3),  and  so  for  any  reachable  s  G  REL(/t:)  with  suceessor 


Thus, 


Pr(s'  I  s,  k)  =  Pr(a;  |  s,  k)  Pr(s'  |  s,x) 


xeXs 


( Xp  I  S ,  Kp  J 


If/. 


xeXs  V  p 

n  I  ^p'>  fp^'(^p)- 

x&Xs  P 

JJ  P^iXp\s,Kp)ff{Xp). 

P  Xp^Xu 


Pr(s  I  k)  =  Pr(f'  I  t, 

t,t'££(s) 


K) 


=  n  n  Pr(a^p 

t,t'££{s)  p  a;peX0p(t) 

=n  n  p^i^p\t^'^p)fp'i 

P  t,t'££{s)  Xp&X^^(^t.-j 

= I  ^p"> 


(  Xr} 


(4.7) 


By  Eq.  (4.7) 


By  Eq.  (4.5) 


□ 

Eor  convenience,  for  any  joint  poliey  R,  we  define  w{s  \  R)  =  Hp  Wp{s  \  Kp),  and  sim¬ 
ilarly  for  a  poliey  R_p  for  all  players  other  than  p,  we  define  w{s  \  R_p)  =  Hp^p  ^p'('®  I 
Kpi).  We  now  prove  a  lemma  that  is  very  useful  in  proving  two  policy  classes  are  equiva¬ 
lent: 

Lemma  4.2.5.  If  Kp  and  are  two  policies  for  player  p  such  that 

E[xp  I  s,  Kp]  =  .E[a;p  I  s,  n'p] 

for  all  s  G  REL(fi;p)  fl  REL(fi;p,  then  REL(fi;p)  =  REL(/T;p),  and  further  Kp  and  Kp  are 
payoff  equivalent. 
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Proof.  Observe  that  REL(fi;p)  and  REL(kP  are  trees  rooted  at  s*:  if  s  G  REL(fi;p),  then 
all  s'  G  path(s)  are  also  in  REL(fi;p).  Thus,  REL(Kp)  fl  REL(/t;P  is  also  a  tree,  and  so  for 
s  G  REL(/s;p)  n  REL(/t;P  we  have  Wp{s  \  k'^)  =  Wp{s  \  Kp)  by  Equation  (4.6).  Suppose 
REL(Kp)  7^  REL(Kp.  Then,  WLOG,  there  exists  (s,  s')  G  E  such  that  s  G  REL(fi;p)  fl 
REL(Kp),  s'  REL(/t;p),  and  s'  G  REL(fi;p.  Then,  Wp{s'  |  Kp)  =  0  and  Wp{s'  \  k'^)  >  0. 
Equation  (4.6)  implies 

I  4)  =  I  '^'p)  ■  I  S])- 

Eurther,  Wp{s  \  k'^)  =  Wp{s  \  Kp)  because  s  G  REL(Kp)  fl  REL(kP  and  E[xp  \  s,  Kp]  = 
E[xp  I  s,  Kp],  and  we  must  have  Wp{s'  \  Kp)  =  Wp{s'  \  Kp),  a  contradiction.  Thus,  we 
conclude  REL(Kp)  =  REL(Kp). 

Now,  we  proceed  to  show  payoff  equivalence.  Eix  any  K_p  for  the  other  players. 
Equation  (4.6)  shows  Kp  and  Kp  have  equal  sequence  weights,  and  so  by  Eemma  (4.2.4), 

Pr(s  I  (Kp,  K-p))  =  Pr(s  I  (Kp,  K-p))  for  all  s. 

Using  Eemma  (4.2.2),  it  is  now  sufficient  to  show  that  the  expected  payoff  for  an 
arbitrary  player  g  G  iV  at  each  state  reached  with  positive  probability  is  equal.  This  is 
clearly  true  at  states  where  p  ^  ^(s).  Consider  some  s  E  Vp  where  s  E  u.  The  key  is 
that  the  payoff  function  is  multi-linear.  Eet  X-p  be  a  joint  action  for  all  players  other 
than  p,  so  that  for  any  Xp  E  Xu,  {xp,  X-p)  E  Xg  is  a  joint  action  at  s.  Then,  multi-linearity 
implies  there  exists  a  vector  Ths{x-p)  E  such  that  for  any  Xp  E  Xu, 

M^{{xp,x_p))  =  mg{x_p)  ■  Xp. 

Now,  for  Kp  and  any  s  with  Pr(s  |  (Kp,  K-p))  >  0, 

E  [Mq{xs)  I  s,  (Kp,k_p)]  = 


Pr(a;  s)M^{x 

) 

x&Xs 

Y.  Y  1 

S,  K-p)  Pt{Xp  I  S,  Kp)Mg{x) 

X  —  p  Xp^JCu 

Y.  Pr(^-p  1  «-p) 

Y  Pr(a^p  s,Kp){rhs{x-. 

X  —  p 

Xp&^u 

/ 

Y  Pr(^-p  1  «-p) 

rng{x-p)  ■  Y  1 

X  —  p 

\  Xp^Xu^ 

Y  Pr(^-p  1  «-p) 

{mg{x-p)  ■  E[xp  1  s,  Kp]) . 
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Since  E[xp  \  s,  Kp]  =  E[xp  \  s,  k'^  at  the  relevant  states,  it  follows  that 

E  [Mq{Xs)  I  s,  {Kp,  K-p)]  =  E  [Mq{Xs)  I  s,  {Kp,  K-p)]  , 
and  so  by  Lemma  (4.2.2),  we  eonelude  Kp  and  k^  are  payoff  equivalent.  □ 


Sufficient  Recall 

We  say  a  CEFG  has  sufficient  recall  if  for  all  players  p  it  has  both: 

•  observation  memory:  For  all  u  G  Up,  and  all  s,  s'  G  u,  obsp(s)  =  obsp(s').  That  is, 
the  information  sets  for  p  form  a  forest. 

•  action  memory:  For  any  two  polieies  Kp  and  Up  for  p,  and  any  poliey  R-p  for  the  other 
players,  and  for  any  u  e  Up  with  Pr(M  |  {Kp,  R-p))  >  0  and  Pr(M  |  {Up,  R-p))  >  0, 
and  any  s  G  u,  we  have 

Pr(s  I  u,  {Kp,  R-p))  =  Pr(s  |  u,  {Up,  R-p)). 

It  is  worth  emphasizing  that  both  observation  memory  and  aetion  memory  are  properties 
of  the  game  itself,  not  of  players  or  polieies. 

Observation  memory  says  that  the  eurrent  information  set  uniquely  speeifies  the  se- 
quenee  of  information  sets  (whieh  we  ean  view  as  the  history  of  observations)  that  have 
previously  oeeurred;  henee  player  p  has  no  ineentive  to  remember  the  information  sets 
visited.  Aetion  memory  implies  that  if  we  know  the  eurrent  information  set  is  u,  then 
remembering  the  poliey  we  followed  up  until  we  reaehed  u  provides  no  information  about 
the  aetual  s  E  u.  Thus,  the  player  need  not  remember  the  poliey  followed  so  far.  The  name 
suffieient  aetion  memory  might  be  more  appropriate,  as  the  exaet  exaet  aetions  taken  at 
past  information  sets  are  not  remembered. 

Informally,  then,  if  the  game  has  suffieient  reeall  for  player  p,  then  player  p  should  be 
able  to  play  optimally  by  seleeting  an  aetion  purely  as  a  funetion  of  the  eurrent  information 
set,  as  from  this  all  relevant  past  aetions  and  observations  ean  be  derived.  We  use  this  dual 
eharaeterization  of  suffieient  reeall  beeause  this  intuition  seems  so  elear. 

However,  to  formally  prove  that  implieit  behavior  reaetive  polieies  are  “strong  enough” 
to  play  suffieient-reeall  CEFGs  optimally,  we  will  introduee  sequence  recall,  an  alternative 
eharaeterization  of  suffieient  reeall  that  makes  establishing  eertain  struetural  lemmas  more 
natural.  Further,  sequenee  reeall  ean  be  viewed  as  a  generalization  of  perfeet  reeall  as  it 
is  usually  defined  for  EFGs.  Before  introdueing  sequenee  reeall,  we  need  to  define  the 
notion  of  the  “outeome”  for  a  player  at  a  state. 
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Generalized  outcomes  In  an  EFG,  all  states  in  an  information  set  u  have  the  same  out 
degree  d,  and  each  outgoing  edge  from  some  s  G  m  is  labeled  with  one  of  d  outcome  or 
choice  labels.  Thus,  the  action  set  in  an  EFG  is  the  set  of  outcome  labels.  We  can  view  the 
choice  labels  in  an  EFG  as  partitioning  all  of  the  edges  out  of  u  into  d  different  equivalence 
classes  based  on  the  labels. 

In  a  CEFG,  nodes  in  u  may  have  different  out-degrees,  and  the  successor  of  s  is  chosen 
from  a  product  distribution  that  is  a  function  of  each  players’  action,  using  the  functions 
fp^'.  Thus,  we  will  need  a  more  complex  partition.  We  define  a  partition  on  the  edges  out 
of  M  G  f/p  equivalence  relation  ~p  on  pairs  of  edges.  For  any  two  edges  (s,  s')  and 
(t,  t')  out  of  u  (e.g.,  s,  t  G  u),  we  have  (s,  s')  (f,  t')  if  and  only  if  there  exists  a  constant 

a  >  0  such  that  for  all  x  G  Xu 


fp^'ix)  =  affix).  (4.8) 

Fet  Ou  be  the  set  of  such  equivalence  classes  at  u  defined  by  ~p,  so  o  G  is  a  maximal 
set  of  edges  such  that  any  pair  of  edges  in  o  satisfies  Equation  (4.8),  and  Uoeo„  ^he  set 
of  all  edges  out  of  u.  In  fact,  if  we  “normalize”  the  CEFG  in  the  manner  suggested  by  the 
next  lemma,  we  can  assume  that  a  =  1  in  Equation  (4.8)  without  loss  of  generality. 

Lemma  4.2.6.  For  any  CEFG  G,  there  exists  an  f  -equivalent  CEFG  G'  such  that  if 
{s,  s')  ~p  it,t')  in  G,  then  for  all  x  G  Xu,  in  G' 

9 fix)  =  gfix), 

where  we  use  g  to  denote  the  f -functions  in  G'. 

Proof  It  is  sufficient  to  show  the  transformation  on  pairs  of  edges.  Suppose  G  has  edges 
(s,  s')  and  (f,  t')  out  of  u  that  fall  into  the  same  partition,  but  the  corresponding  /-functions 
are  not  equal.  Then  there  must  exist  some  a  such  that 

ffix)  =  affix). 

WFOG,  a  <=  1  (if  not,  divide  both  sides  by  a  and  take  a'  =  1/a). 

Even  if  G  has  an  “inactive”  random  player  (that  is,  ff  =  1  for  all  states  s  and  s'  in  G), 
G'  will  have  an  active  one.  We  write  g/f  for  the  constant  /-functions  of  the  random  player 
in  G' .  The  /-functions  in  G'  are  the  same  as  in  G  (in  particular,  gf  =  ff),  except  we  set 
gfix)  ^  ffix),  and  gf  =  aff.  Now,  in  G'  the  /-functions  on  (f,t')  and  (s,  s')  are 
identical  (satisfy  Equation  (4.8)  with  a  =  1),  as  we  have  “moved”  the  constant  difference 
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a  into  the  randomness  on  the  (s,  s')  edge.  That  is, 

(jr 

=  («/o7/r'7i)---/777--- 

=  gfgf{x,)---gf{x,)--- 
=  Pr(s'  I  X,  s), 

and  so  transition  probabilities  in  the  two  games  are  identical.  □ 

We  call  CEFGs  that  have  been  maximally  transformed  using  Lemma  (4.2.6)  /- 
normalized  ;  such  CEFGs  satisfy 

(s,s')~,(f,f')  ^  =  (4.9) 

For  the  remainder  of  this  paper,  we  assume  all  CEFGs  are  /-normalized.  Under  this 
assumption,  we  write  //’°  for  the  /  function  shared  by  all  edges  out  of  u  in  outcome 
partition  o  G  Ou- 

Sequence  recall  Using  this  notion  of  outcome,  we  can  now  define  the  player  p  sequence 
(Tp(s)  associated  with  a  state  s.  The  sequence  o-p(s)  is  the  list  of  player  p’s  information 
sets  and  outcomes  on  the  unique  path  in  T  to  s.  Edges  from  states  s  where  p  ^  ^(s)  do 
not  appear  in  the  sequence  ap{s).  We  write: 

In  general,  we  can  view  <Jp{s)  as  a  refinement  of  obsp(s):  two  states  s,  s'  G  m  might  have 
the  same  observation  history,  but  different  sequences.  We  say  a  CEFG  has  sequence  recall 
for  player  p,  if  for  all  u  e  Vp  and  all  s,  s'  G  u,  <Jp{s)  =  ap(s').  Note  that  sequence  recall 
immediately  implies  observation  memory.  In  fact,  sequence  recall  and  sufficient  recall  are 
equivalent.  Before  proving  this  result,  we  establish  that  action- selection  probabilities  in 
sequence-recall  CEFGs  satisfy  the  following  structural  property: 

Lemma  4.2.7.  Suppose  G  is  an  f  -normalized  CEFG  where  o'p(s)  =  o'p(s')  for  some 
s,  s'  E  u,u  E  Up.  For  any  policy  Up  for  player  p, 

w{s  I  Kp)  =  w{s'  I  Up), 

and  when  w{s  \  Up)  >  0,for  anyx  G  X^, 

Pr(a;  |  s,  Up)  =  Pr(a;  |  s',  Kp). 
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(a)  An  information  set  forest. 


(b)  The  part  of  the  game  tree  corre¬ 
sponding  to  ui  and  U2- 


Figure  4.2:  An  example  CEFG. 


Proof.  The  proof  follows  immediately  from  the  definition  of  tt;(s  |  Kp)  and  Pr(a;  |  s,  Kp) 
solely  in  terms  of  the  /-functions  on  S{s)  (Lemma  (4.2.3)  and  Equation  (4.5)). 

In  particular,  recall  that  the  sequence  weights  and  action  probabilities  are  defined  in 
terms  of  the  one-player  games  Q{p,  s)  and  ^(p,  s').  Since  o-p(s)  =  ap{s'),  by  the  definition 
of  sequence  both  games  pass  through  the  same  information  sets  in  the  same  order,  and 
(because  the  of  sequence  recall  and  the  /-normalized  assumption),  the  functions 

that  determine  if  the  game  continues  are  also  identical.  Hence,  the  games  Q  (p,  s)  and 
Q{p,s')  are  equivalent;  since  we  define  Pr(a;  |  s',Kp)  and  w{s  \  k)  in  terms  of  these 
games,  the  lemma  follows.  □ 

Corollary  4.2.8.  In  an  f  -normalized  CEFG  with  sequence  recall,  for  any  policy  Hp  for 
player  p,  any  u  G  Up,  and  any  s,s'  E  u  and  x  G  Xu,  then  w{s  \  Kp)  =  w{s'  \  Up),  and 
when  w{s  \  Kp)  >  0,  Pr(a;  |  s,  Kp)  =  Pr(a;  |  s',  Kp). 

Corollary  (4.2.8)  reveals  the  significant  structure  of  sequence  weights  in  CEEGs  with 
sequence  recall.  This  structure  is  basically  identical  to  the  structure  of  the  sequence 
weights  in  perfect-recall  EEGs,  hence  justifying  our  adoption  of  the  term  “sequence 
weights”  for  the  Wp{s  \  Kp)  values.  The  principal  results  and  associated  notation  are 
given  below;  they  are  stated  using  the  relationships  of  the  states  and  information  sets  of 
Eigure  4.2,  but  hold  in  general.  Eor  CEEGs  with  sequence  recall  we  extend  our  notation 
for  sequence  weights  and  write  w{u  \  Kp)  =  w{s  \  Kp)  for  any  s  E  u. 

•  Each  non-root  information  set  U2  for  player  p  has  a  unique  (information  set,  out- 
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come)  “parent”:  if  ui  is  the  unique  predecessor  of  U2  in  the  information  set  forest, 
then  any  path  from  ui  to  U2  must  begin  with  an  edge  in  some  fixed  outcome,  say 
oi.  Let  upredp(M2)  =  identify  this  parent;  if  u  is  a  root  information  set, 

we  write  upredp(M)  =  0.  This  situation  is  shown  in  the  information  set  forest  of 
Figure  (4.2)(a). 

•  Any  state  s  occurring  after  some  player  p  information  set  (that  is,  with  a  non-empty 
(Tp(s))  has  a  unique  (information  set,  outcome)  predecessor,  namely  the  last  tuple  in 
(Tp(s).  We  extend  the  upred  notation  to  this  case,  so  for  example  upred(s2)  =  (ui,  Oi) 
(see  Figure  (4.2)(b)).  It  follows  that  in  fact 

w{s2  I  Up)  =  tu(upred(s2)  |  Up)  =  w{ui,oi  \  Kp). 

•  Any  state  s  occurring  before  any  player  p  information  set  has  w{s  \  Kp)  =  1,  for 
example  tc(si  |  Kp)  =  1. 

•  Consider  the  (partial)  game  tree  shown  in  Figure  4.2(b).  We  have  Ui  =  {si,  fi},  and 
U2  =  {-52,12}-  State  Si  has  successor  s},  and  ti  has  successor  t[.  These  edges  are 
in  the  same  outcome  partition  oi,  so  /p  ’  ^  =  /p  ’  C  Thus  Corollary  (4.2.8)  implies 
thatt(;(s'i  I  Kp)  =  w{t[  \  Kp).  In  general,  any  immediate  successor  state  of  u  reached 
via  an  edge  in  a  fixed  outcome  partition  o  must  have  the  same  sequence  weight;  we 
write  w{u,  o  \  Kp)  for  this  value. 

In  summary,  for  any  node  s  G  M2  where  (mi,Oi)  =  upredp(M2),  we  write  any  of  the 
following  equivalently: 

Wp{ap{s)  I  Kp)  =  Wp{s  I  Kp)  =  Wp{u2  I  Kp)  =  Wp{ui,Oi  \  Kp) .  (4.10) 

Sequence  recall  equals  sufficient  recall  We  now  turn  to  proving  that  sequence  recall 
and  sufficient  recall  are  equivalent.  We  will  need  the  following  two  Lemmas: 

Lemma  4.2.9.  Suppose  that  ei  =  (s,  s')  and  €2  =  (1,  t')  with  s,t  E  u  are  in  dijferent 
outcomes  for  player  p,  that  is,  (s,  s')  'fp  {t,t').  Then,  there  exist  xi,  X2  G  Xu  such  that 

fp"' jXi)  ffix2) 

ffiXl)  ^  ffix2)' 

Proof.  If  Cl  and  62  are  in  different  outcomes,  then  for  any  constants  a  >  0,  there  exists  an 
X  G  Xu  such  that 

fp\x)  ^  (x). 


Ill 


Fixa  =  l,andleta;i  G  such  that 7^  afp'(xi).  Now, use/?  =  fp'^'{xi)/fp'{xi)  7^ 

1  as  the  constant,  and  let  X2  such  that 

f;^'{x2)y^Pfl^'{x2).  (4.11) 

Dividing  both  sides  of  Equation  (4.11)  by  fp'{x2)  and  using  the  definition  of  /9  yields  the 
Lemma.  □ 


The  next  lemma  shows  that  the  ratio  Pr(s)/Pr(s')  for  s,s'  G  u  does  not  depend  on 
player  p’s  policy. 


Lemma  4.2.10.  A  CEFG  has  action  memory  for  player  p  if  and  only  if  for  all  u  G  Up, 
any  two  policies  Hp  and  Up  for  p,  and  any  joint  policy  R-p  for  the  other  players,  for  any 
s,s'  E  u  with  s'  G  REL(fi;p,  R-p), 


Pr(s 

{Up,  U-p)) 

Pr(s 

{Up,  U-p)) 

Pr(s' 

{Up,  K—p)) 

Pr(s' 

{Up,  R-p)) 

(4.12) 


Proof  Let  a*  =  Pr(f  |  {Kp,n-p))  for  i  E  u,  and  let  bi  =  Pr(f  |  {Up,  H-p))  for  i  E  u. 
Observe  that  Pr(M  |  {Kp,  R-p))  =  a*,  and  similarly  for  bi. 

Lix  s,s'  E  u  with  s'  E  REL{Kp,R-p).  If  6*/  =  Pr(s'  |  {Up,  R-p))  =  0,  then  Equa¬ 
tion  (4.12)  fails  to  hold  because  s'  E  REL(/T;p,  R-p)  implies  Pr(s'  |  {rip,  R-p)  >  0;  it  is  also 
easy  to  show  that  action  memory  does  not  hold  in  this  case,  and  so  the  lemma  holds.  We 
now  consider  the  case  where  6^/  >  0.  Action  memory  is  exactly  the  condition  that 


a-s  _  bs 

E*  Ei  bi 


(4.13) 


for  all  s  E  u.  Inverting  both  sides  of  Equation  (4.13)  for  s'  we  have  {'^i  ca) / a-s'  = 
(Ei  bi)/bsi.  Multiplying  the  left-hand- side  of  this  equality  with  the  left-hand- side  Equa¬ 
tion  (4.13),  and  similarly  the  right  with  the  right,  gives 


Ojg 

Oig' 


bs 


which  is  exactly  the  claimed  equality. 


□ 


Now,  we  can  prove  the  main  theorem: 

Theorem  4.2.11.  A  CEFG  has  sufficient  recall  for  player  p  if  an  only  if  it  has  sequence 
recall  for  player  p. 
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Proof.  For  the  first  direction,  assume  the  game  has  sequence  recall.  Then,  for  any  s,s'  E  u 
with  s'  G  REL(fi;p,  R-p),  and  any  policy  Kp  for  player  p,  and  any  joint  history  policy  R-p 
for  the  other  players,  we  have 


Pr(s 

Wp(^S 

1  f7p)Wp{s  1 

K-p)  _  Wp{s 

f7-p) 

Pr(s' 

1  /^_p))  Wp(^S  1 

Kp)Wp{s' 

1  R-p))  Wp{s'  1 

^-p)) 

since  Wp{s  \  Kp)  =  Wp{s'  \  Kp)  by  Corollary  (4.2.8).  Since  the  right  hand  side  of  the  equal¬ 
ity  does  not  depend  on  Kp,  we  conclude  Equation  (4.12)  holds,  and  so  by  Lemma  (4.2.10), 
we  have  action  memory.  Observation  memory  is  immediate  from  sequence  recall. 

For  the  other  direction,  assume  the  game  has  sufficient  recall  for  player  p.  Assume 
for  contradiction  there  exists  an  information  set  U2  where  sequence  recall  does  not  hold: 
there  exist  S2,t2  G  U2  such  that  (Jp{s2)  f  <yp{t2).  Both  S2  and  t2  share  a  predecessor 
information  set  Ui  (observation  recall  holds,  and  the  observation  history  cannot  be  empty 
or  their  sequences  would  agree).  Let  si  be  the  state  in  ui  on  the  path  to  S2,  and  let  be 
Si’s  successor  on  the  path  (possibly  s'^  =  S2),  so  (si,  s'^)  is  an  edge.  Similarly  identify  an 
edge  (ti,  t'f  out  of  ui  on  the  path  to  t2.  Without  loss  of  generality,  assume  ap(si)  =  (Jpfi) 
(if  this  doesn’t  hold  immediately,  let  S2  ^  Si  and  t2  ^  ti,  and  continue  until  this  process 
as  needed).  This  situation  is  shown  in  Figure  (4.2b). 

Since  ap(si)  =  apfi),  but  <7^(52)  f  <7^(12),  then  these  two  sequences  must  differ  on 
the  last  outcome  (the  outcome  from  ui),  that  is,  (s,  s')  'fp  {t,  t').  By  Lemma  (4.2.9)  there 
exist  an  xi,X2  G  Xu  such  that 


fp'Hxi)  f/Hx2)' 


(4.14) 


Let  TTi  be  any  pure  reactive  policy  with  Wp{si  |  tti)  >  0  and  n{ui)  =  xi,  and  let  7r2  be 
the  same  as  tti,  except  that  tt2{ui)  =  X2.  Then,  fix  any  with  t2  G  REL(fi;_p),  and  let 
B  =  w{s2  I  R-p)/w{t2  I  R-p),  so 


Pr(g2  I  i'7Tl,K-p))  ^  Wp{s2  I  7ri)w(g2  |  K-p)  ^  ^Wpjsi  \  TTi)  fp'"'^  (xi)  ^  ^ 

Pr(t2  I  (ttuR-p))  Wp{t2  I  TTi)w{t2  \  R-p)  I  7ri)/p'*'i(a;i)  fp'Hxi) 


since  (7p(si) 
argument. 


(7p(ti)  and  so  by  Lemma  (4.2.7)  Wp{si  \  n)  =  Wp(ti  \  n).  By  an  analogous 


Pr(g2  I  {7^2,  K-p))  ^  ^fp'"^{x2) 

FT{t2\  {712,  R-p))  fp'*' ix2)' 


(4.15) 
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and  so  Equation  (4.14)  implies. 


Pr(g2  I  (tti,  K-p))  ^  Pr(g2  |  {ti2,  i^-p)) 

Pr(t2  I  (7ri,K_p))  ^  Pr(t2  |  {'K2,K-p))- 

Thus,  by  Lemma  (4.2.10)  action  memory  does  not  hold  at  U2,  contradicting  the  assumption 
that  the  game  has  sufficient  recall  for  p,  and  so  we  conclude  sequence  recall  holds  for  all 
player  p  information  sets.  □ 


Based  on  Theorem  (4.2.11),  we  can  apply  the  notation  from  Equation  (4.10)  to  suffi¬ 
cient  recall  games. 


A  payoff  equivalence  theorem  Now  we  can  give  this  sections  principal  result:  implicit 
behavior  reactive  policies  are  payoff  equivalent  to  general  policies  in  sufficient-recall  CE- 
EGs.  This  is  critical,  as  our  optimization  technique  will  let  us  find  the  best  implicit  behav¬ 
ior  reactive  policy. 

Theorem  4.2.12.  For  sufficient-recall  CEFGs,  for  any  policy  Kpfor  player  p,  there  exists 
a  payoff  equivalent  implicit  behavior  reactive  policy. 

Proof  Let  Up  be  an  arbitrary  policy  for  p.  A  consequence  of  Lemma  (4.2.7)  is  that  for  all 
g,  s'  E  u,  when  g,  s'  E  REL(/T;p), 

E[xp  I  g,  Up]  =  E[xp  I  s',  Kp]. 

Call  this  value  Xu  for  each  u  where  it  is  defined  (e.g.,  where  3g  G  m  such  that  g  G  REL(fi;)), 
and  pick  Xu  arbitrarily  in  for  the  remaining  u  E  Up.  Then,  we  define  an  implicit 
behavior  policy  by  I3^{u)  =  x^.  These  two  policies  must  play  the  same  action  in 
expectation  at  any  state  g  where  w{s  |  /tp)  >  0  and  w{s  |  /9)  >  0.  Thus,  by  Lemma  (4.2.5) 
they  are  payoff  equivalent. 

□ 

Theorem  (4.2.12)  shows  that  when  playing  sufficient  recall  CELGs,  it  suffices  to  con¬ 
sider  only  implicit  behavior  reactive  policies.  In  the  next  section  we  show  that  for  two- 
player  zero-sum  sufficient  recall  CELGs,  the  set  of  IBRPs  for  each  player  can  be  repre¬ 
sented  as  a  convex  set  W  in  such  a  way  that  the  value  of  the  game  is  multi-linear  in  W. 
Thus,  we  can  solve  zero-sum  sufficient-recall  CELGs  using  linear  programming  on  the 
convex  game  defined  by  the  sets  W  and  corresponding  multi-linear  objective  function. 
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4.3  Solving  a  CEFG  by  Transformation  to  a  Convex  Game 


We  consider  a  zero-sum  CEFG  with  two  players,  x  and  y,  and  possibly  a  random  player 
0.  To  differentiate  the  two  players,  we  use  u  and  Xu  to  denote  player  x’s  information 
and  action  sets,  and  similarly  v  and  for  y.  In  the  two-player,  zero-sum  case  the  costs 
at  a  node  s  where  both  players  play  is  specified  via  a  payoff  matrix  M®  of  dimension 
Uu  X  Uu',  the  payoff  from  x  to  y  is  then  is  then  x'^M^y  when  x  plays  x  and  y  plays  y. 
Since  =  {!},  we  use  this  same  notation  to  indicate  payoffs  where  only  one  player 
selects  an  action;  in  this  case  the  payoff  is  the  dot  product  of  a  cost  vector  with  the  action 
of  the  active  player.  The  random  player,  if  present,  does  not  affect  payoffs  directly,  and  so 
this  notation  still  applies  in  the  presence  of  a  random  player.  In  fact,  the  random  player 
only  affects  the  game  through  her  sequence  weights,  which  we  write  as  wo{s),  because  the 
random  player  has  no  policy. 

An  IBRP  for  player  x  can  be  viewed  as  a  vector  from  the  convex  set 

X  =  (g)  X,. 

The  set  X  is  a  Cartesian  product  of  convex  sets,  and  so  it  is  also  a  convex  set.  Define 

Y  analogously  for  y,  and  let  /9x  G  X  and  /3y  e  Y  he  two  IBRPs.  Let  u  =  0x(-s)  and 

V  =  0y(s),  and  define 

V(s)  =  Pr(s  I  (/3x,/9y))  E[M^{xs)  \  s,  (/3x,/9y)] 

=  wo{s)w{s  I  Py,)w{s  I  Py)  AE  Py{v)]  (4.16) 

using  Lemma  (4.2.4).  The  expected  payoff  from  x  to  y  is 

V  =  V(s) 

s6REL(/3x,A) 

by  Lemma  (4.2.2).  Unfortunately,  )2(s)  is  not  bilinear  in  /9x  and  Py,  as  w{s  \  PP)  is  a 
product  of  iPxiPxis))  terms  along  the  path  to  s,  and  each  of  these  terms  is  a  linear 
function  of  /Sx-  Further,  each  term  contains  both  the  w{s  \  PP)  term  and  the  PPpP-s)),  so 
even  if  w{s  \  PP  wasn’t  nonlinear,  PP4)Ps))w{s  \  PP  would  be. 

We  now  develop  an  alternative  convex  representation  for  IBRPs  in  which  V(s)  is  bilin¬ 
ear.  Our  use  of  sequence  weights  as  variables  is  analogous  to  the  technique  in  Roller  et  al. 
[1994],  but  our  approach  must  also  represent  the  implicit  behavior  taken  at  each  X„,  as 
this  is  not  defined  by  the  sequence  weights  alone.  More  precisely,  we  construct  a  set  Wx 
such  that  there  is  a  (nonlinear)  bijection  between  Wx  and  X,  so  each  vector  in  c^x  €  Wx 
has  a  natural  interpretation  as  an  IBRP.  Further,  the  value  V  of  the  game  is  linear  in  for 
a  fixed  policy  for  y. 
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The  sequence  form  of  CEFGs  We  describe  the  policy  representation  set  Wx  for  player 
X,  it  is  analogous  for  y.  Our  construction  of  the  set  Wx  relies  on  the  sets 

=  {{ax,  a)  \  X  G  Xu,  a  >  0}  C 

for  each  u  G  0x-  The  set  is  the  cone  extension  of  Xu,  and  it  is  also  convex  [see  Boyd 
and  Vandenberghe,  2004,  Sec.  2.1.5];  in  fact,  if  Xu  is  a  polyhedron  (defined  by  a  finite 
number  of  linear  equalities  and  inequalities),  then  so  is  Xu,  see  Appendix  (B).  We  will 
treat  elements  of  X^  as  tuples,  writing  a)  G  Xu  where  x^  G  and  a  G  M. 


Define 


We  will  have  Wx  C  X^.  We  work  with  a  vector  G  X^  by  writing  cux  =  |  m  G  (7x), 

where  the  Xu  G  and  G  M  variables  are  defined  for  all  m  G  f/x  by  W- 

The  set  Wx  is  defined  by  the  following  constraints: 


«,».)€  A-  (4.17) 

Wu  =  ^  Vm  G  f/x  with  upredp(M)  =  0  (4.18) 

Wu  =  /x'’°'  ■  Xu'  Vm  G  (7x  with  upredp(M)  =  {u  ,  o').  (4.19) 

We  write  to  emphasize  the  linearity  of  the  /  functions.  The  set  Wx 

is  convex  as  X'^  is  convex  and  the  constraints  are  linear. 


First,  we  show  Wx  is  in  1-1  correspondence  with  a  set  of  IBRP  policies  (represented 
as  elements  in  X).  We  do  not  consider  the  full  set  X  for  technical  reasons:  A  behavior 
policy  /9  G  X  can  be  “over- specified,”  in  that  /3  defines  an  action  /3{u)  G  X„  even  when 
w{u  \  /3)  =  0  (and  hence  u  cannot  possibly  be  reached  when  playing  (3).  For  each  u  G  (7x, 
pick  an  arbitrary  action  Xu  G  X„.  We  define  the  function  J  :  X  — X  which  we  use  to 
specify  a  canonical  representation  of  behavior  policies.  Define 


j{P){u)  = 


/3{u)  when  w{u  |  /9)  >  0 
Xu  otherwise 


so  that  J(X)  =  {J{/3)  I  /9  G  X}.  For  any  G  X,  the  policies  (3  and  J(/9)  play  the  same 
action  at  all  information  states  possibly  reached,  and  so  must  be  payoff  equivalent.  Hence, 
optimizing  over  J(X)  is  equivalent  to  optimizing  over  X. 

We  now  show  a  bijection  g  between  Wx  and  J{X),  defined  by  g{ujy)  =  (3^^  where 
G  X  is  the  IBRP  defined  by 

/ 

{l/wu)Xu  when  >  0 
Xu  otherwise. 
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Sense  Wu)  e  X!^,  it  follows  from  the  definition  of  cone  extension  that  {l/wu)x'^  G 
and  so  /9  is  a  valid  IBRP  and  so  g  is  well-defined.  Next,  we  prove  g  is  a  bijection: 

Theorem  4.3.1.  The  function  g  is  a  bijection  between  Wx  and  J{X). 


Proof.  We  need  to  show  is  1-1  and  onto. 

To  show  g  is  onto  J{X),  consider  an  arbitrary  G  J{X).  Define  a  c^x  by  =  w{u  \ 
(3)  and  x'j  =  w{u  \  /3)/3{u).  It  follows  from  the  definition  of  g  and  J  that  g{ujf)  =  f3.  It 
remains  to  G  Wx.  First,  it  is  straightforward  to  verify  that  Constraints  (4.17)  and  (4.18) 
are  satisfied.  For  Constraint  (4.19),  let  («',  o')  =  upredp(M),  and  observe  that 


Wu  =  w{u  I  P)  =  w{u' y  o'  I  P) 


=  f 


■X 


c 

u'  * 


For  1-1,  suppose  cUx  =  ((a;(),  Wuf)  and  a;(  =  ((|/)^,  vf))  in  Wx,  such  that  cUx  7^  cu^,,  but 
g{u)y)  =  gfc'f).  Let  f3  =  g{u)f)  and  (3'  =  g{io'f}.  WLOG,  let  u  be  an  information  set  where 
{Xu,  Wu)  f  {Vu-,  Xu),  but  for  all  earlier  information  sets  cUx  and  agree.  Then,  Wu  and  Vu 
must  be  equal  by  Constraint  (4.19),  and  so  Xu  and  must  differ.  However,  this  implies  (3 
and  (3'  must  play  differently  at  u,  a  contradiction.  □ 


When  we  refer  to  elements  of  Wx  as  policies,  we  mean  the  corresponding  IBRP  given 
by  the  bijection  g.  Now,  we  show  that  payoffs  are  bilinear  in  the  Wx  representation. 

Theorem  4.3.2.  In  a  two-player,  zero-sum,  sufficient  recall  CEFGs,  represent  x’s  IBRPs 
as  Wx,  and  player  y  ’s  IBRPs  as  Wy.  Then,  for  any  cUx  G  Wx  and  uiy  G  Wy,  the  payoff 
V{g{ijJyf),  giujy))  is  a  bilinear  function  o/cUx  and  Uy. 

Proof  Equation  (4. 16)  shows  the  payoff  is  a  sum  over  states  that  are  reached  with  positive 
probability  under  cUx  and  Uy.  It  is  sufficient  to  show  that  the  payoff  term  for  each  state  is 
bilinear. 

Let  cUx  =  {xf,  Wu)  and  ujy  =  {yf,  Qu),  and  let  /9x  and  fjy  be  the  corresponding  IBRPs. 
The  exact  representation  depends  on  which  players  are  active.  First,  consider  the  case 
where  both  x  and  y  are  active  at  s,  say  u  =  0x(-s)  and  v  =  0y(s).  Then,  we  have 

17(5)  =  wq{s)  w{s  I  (3fj  w{s  I  (3y)  /3x(m)^  (3y{v) 

=  wo{s)  {w{u  I  (3fj  /3x(m))’^  Mf  w{v  I  (3y)  (3y{v) 

=  Wo{s)xfMfyl, 
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and  so  the  payoff  is  bi-linear.  The  case  where  only  one  player,  say  x,  is  active  at  s  is 
similar.  Let  u  =  0x(-s),  and  let  (n,  o)  =  upredy(s).  For  this  case,  it  is  useful  to  define 
Wv,o  =  /y ■  Vu-  Note  that  hen  =  w{v,  o  \  /3y).  Then,  we  have, 

V(s)  =  wo{s)  w{s  I  /3x)  w{s  I  (3y)  (3y,{uf  M*  /3y(0p) 

=  wo(s)  w{u  I  /3x)  w{v,  o  I  Py)  P^iuY'  1 

=  Wo(s) 

and  again  the  payoff  is  bi-linear.  The  case  where  only  y  is  active  is  analogous,  and  the  case 
where  neither  player  is  active  (e.g.,  leaf  nodes)  is  a  simple  extension.  □ 


4.4  Applications  of  CEFGs 

In  this  section,  we  give  high-level  descriptions  of  how  a  variety  of  problems  can  be  mod¬ 
eled  as  CEFGs,  and  note  that  modeling  these  problems  as  standard  stochastic  games  or 
EFGs  would  require  at  least  an  exponential  blow-up  in  representation  size. 


4.4.1  Stochastic  Games  and  POSGs 

We  have  demonstrated  that  a  CEEG  can  be  represented  as  a  bilinear-payoff  convex  game, 
and  so  we  can  use  such  games  as  the  stage  games  of  a  convex  stochastic  game.  In  Sec¬ 
tion  (3.5)  we  discussed  using  EEGs  in  this  manner.  This  approach  is  quite  powerful,  but 
the  time  to  compute  a  minimax  equilibria  will  still  in  general  be  exponential  in  the  number 
of  actions  taken  between  periods  of  full  observability. 

In  planning  applications  it  is  quite  common  that  each  player  fully  observes  their  own 
position,  but  only  has  partial  observability  of  the  adversary.  Eurther,  observations  of  the 
adversary  may  occur  relatively  rarely  compared  with  the  selection  of  primitive  actions. 
In  this  case,  it  may  be  possible  to  represent  the  sequence  of  actions  selected  between 
observations  as  a  single  choice  from  a  convex  action  set,  for  example  the  selection  of  a 
(partial)  policy  in  an  MDP.  To  represent  such  a  scenario  in  a  standard  EEG,  each  action 
choice  in  the  MDP  would  be  an  action  in  the  EEG  as  well,  requiring  us  to  roll  out  the  MDP 
until  an  observation  occurs.  This  EEG  would  likely  be  prohibitively  large.  Using  CEEGs, 
however,  we  can  use  a  single  node  to  model  the  selection  of  a  partial  policy  that  determines 
actions  up  to  the  next  point  where  an  observation  might  occur  by  representing  the  set  of 
such  policies  as  a  convex  set.  Thus,  the  depth  of  the  CEEG  embedded  in  the  convex  game 
only  depends  on  the  number  of  potential  observations  involving  the  adversary  between 
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periods  of  full  observability,  rather  than  the  number  of  primitive  actions  (which  could  be 
much  larger),  producing  an  exponentially  smaller  representation. 

If  an  observation  can  happen  at  any  time,  this  approach  will  not  work:  the  observation 
model  needs  to  limit  observations  to  occur  only  at  certain  times  (say,  every  15  seconds)  or 
at  certain  designated  states.  This  restricted  observation  model  could  be  the  true  observation 
model,  or  it  could  be  an  approximate  model  designed  to  yield  a  tractable  planning  problem. 
Using  an  approximate  observation  model  for  planning  does  not  limit  what  observations  are 
actually  used,  it  only  limits  what  observations  for  which  we  can  plan.  That  is,  during  the 
actual  execution  of  a  policy,  if  we  get  an  observation  while  in  the  middle  of  executing 
some  partial  policy  we  can  always  re-plan  from  that  point  based  on  the  new  observation. 
However,  this  approach  can  make  no  guarantees  about  the  quality  of  solution  executed. 


4.4.2  Extending  Cost-paired  MDP  Games  with  Observations 

In  Section  3.4,  we  introduced  the  notion  of  a  game  where  one  player  selects  a  policy 
in  an  MDP,  and  the  other  player  selects  a  cost  vector  for  that  MDP.  This  allowed  the 
modeling  of  an  interesting  sensor-placement  problem.  We  also  showed  how  the  model 
can  be  generalized  to  the  case  where  both  players  select  policies  in  an  MDP,  and  the 
total  cost  of  a  policy  is  expressed  via  a  bi-linear  function  of  the  two  players  state-action 
visitation  frequencies.  Representing  this  interesting  convex  game  as  an  EFG,  however, 
requires  using  the  standard  transformation  to  the  normal  form  representation,  which  entails 
an  exponential  blowup  in  the  size  of  the  representation.  This  problem  can,  however,  be 
modeled  as  a  single-node  CEFG  by  simply  embedding  the  convex  game  representation. 
This  is  one  demonstration  of  the  representational  power  of  CEFGs. 

Further,  the  CEFG  representation  makes  it  possible  to  represent  interesting  variations 
on  this  problem  that  cannot  be  represented  as  cost-paired  MDP  games.  In  particular,  we 
can  model  some  observations  of  the  other  player’s  actions  using  a  deeper  game  tree.  The 
details  of  the  observation  formulation  are  important:  generally,  the  size  of  the  CEFG  will 
be  exponential  in  the  number  of  states  in  the  underlying  MDP  where  observations  can  be 
made;  however,  the  CEFG  formulation  lets  us  solve  approximations  where  only  the  most 
important  observations  are  considered.  We  can  trade  off  computation  time  and  approxi¬ 
mation  accuracy  by  considering  more  or  fewer  observation  points. 

For  example,  suppose  the  robot  can  detect  the  adversary’s  sensors  in  the  observation 
avoidance  game.  Modeling  the  possibility  of  making  these  observations  at  all  states  gives 
rise  to  the  full  (intractable)  POSG  model  (see  Section  3.4.2).  However,  suppose  we  only 
designate  a  few  states  where  observations  are  considered — perhaps  those  states  that  corre- 
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spond  to  the  robot  peeking  around  a  eomer  where  a  sensor  is  partieularly  likely.  Then  we 
ean  use  the  CEFG  representation  to  eonstruet  a  game  tree  that  is  exponential  in  the  size  of 
this  small  set  of  “observation  states,”  but  with  only  polynomial  dependenee  on  the  size  of 
the  full  state  spaee. 

This  approaeh  ean  also  be  applied  to  approximately  solving  a  generalization  of  the 
adversarial  Canadian  traveler’s  problem. 


The  adversarial  Canadian  traveler’s  problem  The  Canadian  traveler’s  problem  (CTP) 
is  the  problem  of  eomputing  a  shortest  path  on  a  graph  that  is  known,  exeept  that  eertain 
edges  may  be  impassable;  whether  an  edge  is  passable  or  not  is  only  revealed  when  the 
agent  reaehes  an  adjaeent  node.  There  has  been  work  on  both  the  stochastic  version,  where 
there  is  a  known  probability  distribution  that  determines  whether  an  edge  is  passable  or 
not,  and  the  adversarial  version,  where  an  adversary  pieks  whieh  edges  are  impassable 
(with  some  restrietions).  We  generalize  the  adversarial  version  by  allowing  the  adversary 
to  piek  an  assignment  of  eosts  to  the  edges;  an  extremely  high  eost  ean  be  used  to  model  an 
impassable  edge.'^  The  stoehastie  version  of  the  problem  ean  be  formulated  as  a  POMDP, 
while  the  adversarial  version  is  a  POSG;  even  the  stoehastie  version  is  #P-hard  [Bar-Noy 
and  Sehieber,  1991,  Papadimitriou  and  Yannakakis,  1991]. 

This  problem  arises  naturally  in  mobile  robot  path  planning,  where  the  uneertainty 
over  edges  in  the  graph  might  eorresponds  to  uneertainty  about  whether  a  door  will  be 
open  or  elosed  or  a  bridge  will  be  up  or  down.  The  robot-helieopter  eoordination  prob¬ 
lem  of  Likhaehev  et  al.  [2005]  ean  be  formulated  as  a  CTP;  the  belief  spaee  is  finite,  and 
Likhaehev  et  al.  solve  large  instanees  of  this  problem  by  ignoring  the  POMDP  strue- 
ture  and  instead  using  a  elever  applieation  of  heuristie  seareh  to  the  mostly-deterministie 
belief-spaee  MDP.  The  resulting  algorithm  is  ealled  MCP.  Ferguson  et  al.  [2004]  give  the 
PAO*  algorithm  for  “deterministie  deeision  problems  with  hidden  state,”  whieh  ean  eas¬ 
ily  be  transformed  to  instanees  of  the  Canadian  traveler’s  problem  on  a  partieular  graph. 
Both  Fikhaehev  et  al.  [2005]  and  Ferguson  et  al.  [2004]  eonstruet  a  eompressed  repre¬ 
sentation  eomprised  only  of  the  states  adjaeent  to  edges  whieh  may  be  impassable  (and 
henee  observations  may  oeeur);  in  PAG*  this  eompressed  representation  is  always  fully 
eonstrueted,  while  MCP  only  eonstruets  the  portion  relevant  to  the  seareh  from  a  fixed 
start  state  to  a  fixed  goal.  Blei  and  Kaelbling  [1999]  also  eonsider  the  CTP  and  diseuss  its 
representation  as  an  MDP;  they  eall  the  problem  the  “bridge  problem.”  Fita  et  al.  [2001] 
eonsider  a  multi-agent  version  of  the  CTP. 

^^While  this  approach  works  well  in  the  offline  case,  for  the  repeated  game  (online  learning)  case  bounds 
typically  depend  on  the  maximum  edge  cost,  and  so  this  approach  may  force  online  learning  algorithms  to 
have  poor  bounds. 
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Our  adversarial-cost  generalization  of  this  problem  is  formulated  as  follows:  player 
one  needs  to  get  from  a  fixed  start  state  to  a  known  goal  state  in  a  graph.  Player  two 
selects  a  cost  vector  (assigning  a  fixed  cost  to  each  edge)  from  some  finite  set.'"^  This  is 
just  an  MDP  with  adversary-controlled  costs  under  the  assumption  that  player  one  doesn  ’t 
observe  the  costs  she  incurs.  In  some  domains  this  may  be  reasonable,  but  in  others  it 
might  not  be — for  example,  if  the  opponent-chosen  costs  correspond  to  the  placement  of 
obstacles  or  active  interference.  For  these  domains,  we  can  adopt  the  CTP  observation 
model,  namely  that  player  one  can  observe  the  cost  of  an  edge  from  an  adjacent  state.  If 
we  allow  plans  that  take  all  of  these  possible  observations  into  account,  we  have  the  full 
CTP.  But,  suppose  that  there  are  only  a  few  edges  that  can  be  made  arbitrarily  expensive: 
in  a  navigation  example  these  might  correspond  to  doors  that  can  be  shut,  bridges  that 
can  be  destroyed,  or  narrow  passes  that  can  be  blocked.  We  can  efficiently  approximate 
this  problem  by  optimizing  over  the  set  of  policies  that  only  take  into  account  observed 
edge  costs  from  states  adjacent  to  potentially  expensive  edges.  This  class  of  plans  still  has 
great  power  to  reason  about  the  fact  that  the  adversary  has  some  control  over  the  costs  of 
all  edges:  we  simply  restrict  ourselves  from  selecting  policies  that  are  contingent  upon 
observing  these  costs.  As  with  the  application  of  CEFGs  to  convex  stochastic  games,  if 
we  use  such  a  limited  observation  model  we  can  re-plan  upon  receiving  an  observation  the 
original  plan  did  not  take  into  account. 


4.4.3  Perturbed  Games  and  Games  with  Outcome  Uncertainty 

Selten  [1975]  originally  introduced  perturbed  EFGs  in  his  investigation  of  models  of  se¬ 
quential  rationality.  He  describes  how  a  perturbed  EEG  is  formed  from  a  standard  EEG 
by  introducing  a  model  of  “trembles”  at  each  information  set:  each  time  a  player  selects 
an  action,  there  is  a  small  probability  that  a  different  action  is  taken  instead.  It  is  assumed 
that  these  probabilities  are  common  knowledge.  Eor  a  modern  introduction  to  different 
equilibria  refinements,  consult  Perea  [2002]. 

We  show  that  the  class  of  perturbed  extensive-form  games  can  be  compactly  repre¬ 
sented  as  CEEGs,  while  their  EEG  representations  are  exponentially  larger.  We  begin  with 
the  model  of  Selten,  but  then  extend  his  model  to  general  outcome  uncertainty.  This  lets  us 
generalize  extensive-form  games  in  much  the  same  way  that  Markov  decision  processes 
generalize  deterministic  path  planning.  The  analogy  is  not  perfect,  because  perturbed 
EEGs  are  still  representable  as  EEGs  (but  at  the  cost  of  an  exponential  blowup  in  size), 
while  a  general  MDP  cannot  be  modeled  by  any  deterministic  planning  problem.  The  ad¬ 
vantage  of  using  CEEGs  to  represent  perturbed  EEGs  is  that  we  can  avoid  the  exponential 

'“^We  can  relax  this  if  we  further  restrict  the  kinds  of  observations  player  one  makes. 
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Figure  4.3:  Representing  a  perturbed  EFG  as  an  EFG  and  as  a  CEEG. 


blowup  in  the  size  of  the  representation. 

Eix  a  standard  EEG,  and  let  A{u)  be  the  set  of  actions  at  an  information  set  u  for 
player  one.  In  the  unperturbed  EEG,  the  player  chooses  some  action  a  G  A{u),  and  the 
dynamics  of  the  game  ensure  that  this  choice  is  actually  “executed”  in  the  world  (whatever 
that  means  for  the  particular  game).  A  perturbed  EEG  introduces  a  level  of  indirection 
between  a  player’s  selection  of  an  action  and  the  execution  of  that  action  in  the  world. 
A  perturbation  function  :  A{u)  A{A{u))  maps  the  choice  a  made  by  the  player 
to  a  distribution  tu{a)  over  A{u)  from  which  the  action  that  actually  occurs  is  taken.  We 
write  tu{a)  {a')  for  the  probability  of  action  a'  actually  being  executed  given  that  the  player 
selected  action  a.  Eor  example,  in  constructing  trembling-hand  equilibria,  it  is  common  to 
consider  perturbation  functions 

,  fl-e(|A„|  -1)  ifa'  =  a 

I  e  otherwise 

where  |A„|e  is  some  small  probability  of  a  getting  a  uniform  random  action  rather  than 
the  chosen  action  [Selten,  1975].  It  is  standard  to  assume  that  the  player  at  u  observes 
which  a'  actually  occurred;  other  players  in  the  game  only  observe  this  if  they  observed 
the  player’s  action  at  u  in  the  unperturbed  game. 

A  perturbed  EEG  is  in  fact  still  an  extensive  form  game.  The  player  selects  an  action 
a  E  A{u)  as  in  the  original  game,  but  after  this  choice  a  new  random  node  is  inserted.  The 
game  transitions  to  this  random  node,  which  has  successors  in  1-1  correspondence  with 
A{u).  The  action/successor  a'  at  this  node  is  then  chosen  according  to  the  distribution 
tu{a)  (o'),  and  the  game  continues  as  if  the  player  had  selected  a'  at  u  in  the  original  game. 
While  we  can  represent  the  perturbed  game  in  this  way,  we  have  in  general  doubled  the 
number  of  nodes  on  any  root  to  leaf  path;  of  course,  doubling  the  depth  of  the  tree  in  gen- 
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eral  causes  an  exponential  blowup  in  its  size,  and  hence  the  size  of  the  EFG  representation. 
Since  algorithms  for  EFGs  are  polynomial  in  this  size,  we  have  likely  just  taken  a  tractable 
problem  and  made  it  intractable. 

This  transformation  is  shown  in  parts  (a)  and  (b)  of  Figure  (4.3).  Part  (a)  shows  the 
original  node  si  in  the  EFG  game  tree  where  we  will  introduce  perturbations;  there  are  two 
actions  from  node  Si,  a  and  b.  If  action  a  is  taken,  the  game  continues  to  the  subgame  A, 
while  if  b  is  taken  the  game  continues  to  subgame  B.  Note  that  A  and  B  can  be  arbitrarily 
large  trees.  For  simplicity  we  do  not  consider  information  sets.  Part  (b)  of  the  figure 
then  shows  the  introduction  of  random  nodes  ri  and  r2  that  implement  the  e  perturbations. 
The  game  tree  of  (b)  thus  remembers  both  which  action  the  player  wanted  to  happen  as 
well  as  which  action  actually  happened:  hence  there  are  two  copies  of  the  subtrees  A  and 
B,  doubling  the  size  of  the  EFG.  Applying  this  transformation  at  every  information  set 
leads  to  an  exponential  increase  in  representation  size.  Part  (c)  shows  the  efficient  CEFG 
representation,  which  we  discuss  below. 

We  now  show  that  a  perturbed  EFG  has  a  representation  as  a  CEFG  of  size  polynomial 
in  the  size  of  the  original  game  and  the  size  of  the  representation  of  the  functions  tu- 
Before  introducing  this  representation,  we  first  generalize  our  notion  of  perturbed  EFGs  to 
include  a  complete  model  of  outcome  uncertainty. 

We  generalize  Selten’s  model  by  decoupling  the  set  of  actions  available  to  the  player 
from  the  set  of  “actions”  (perhaps  better  called  outcomes)  that  may  actually  be  executed 
in  the  world.  Formally,  we  no  longer  assume  tu  maps  from  the  original  set  of  actions  to 
distributions  on  this  same  set.  Instead,  let  Ou  represent  the  outcomes  that  may  occur  in 
the  world,  and  define  some  new  set  A'{u)  =  {pi, . . .  ,pk}  of  probabilistic  (meta-)actions 
for  the  player.  Each  action  pi  specifies  a  distribution  over  possible  outcomes,  that  is. 
Pi  G  A(0„).  Hence  the  analogy  to  MDPs,  where  an  action  at  a  state  is  defined  by  the 
distribution  over  successor  states  it  induces.  The  perturbed  game  is  played  as  follows: 
when  p  G  A'{u)  is  selected,  the  actual  outcome  that  occurs  is  sampled  from  Ou  according 
to  the  distribution  p.  That  is,  in  this  model  we  have  tu  '■  A'{u)  A(Ou).  But  since  we 
defined  each  p  G  A'{u)  as  a  distribution  over  Ou,  the  perturbation  function  for  a  game 
defined  in  this  way  is  simply  the  identity  function,  tu{p)  =  p  for  p  G  A' {u). 

We  now  show  how  to  transform  an  EFG  with  a  perturbation  model  into  a  compact 
CEFG.  To  represent  an  unperturbed  EFG  as  a  CEFG,  we  kept  the  same  game  tree,  and 
replace  the  finite  action  set  A{u)  with  the  convex  action  set  A{A{u))  at  each  information 
set  u.  To  represent  the  perturbed  EFG  we  again  keep  the  same  tree  structure,  but  the 
set  of  available  actions  at  u  will  be  the  convex  set  Xu  C  A(0„)  corresponding  to  those 
distributions  that  are  realizable  given  the  choices  in  A'{u).  We  treat  each  p  G  A'{u)  as  a 
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vector  in  A{Ou)  C  and  so  the  set  of  achievable  distributions  is 

X^  =  H{A\u)), 

the  convex  hull  of  the  set  of  explicitly  allowed  distributions.  An  example  of  this  represen¬ 
tation  is  shown  in  part  (c)  of  Figure  (4.3).  We  have  X{u)  =  {a',  6'},  where  a'  gives  the 
distribution  (1  —  e,  e)  on  the  outcomes  (A,  B),  while  U  gives  the  distribution  (e,  1  —  e). 
In  the  CEFG  representation,  there  is  no  need  to  “remember”  in  the  game  tree  whether 
A  occurred  because  the  player  chose  it  explicitly,  or  because  randomness  picked  it.  In 
fact,  this  distinction  is  not  even  well-defined  in  the  representation:  how  would  the  action 
d  =  (0.5,  0.5)  be  interpreted? 

Of  course,  in  general  there  is  no  need  to  require  that  is  represented  as  the  convex 
hull  of  some  finite  set  of  actions/distributions  A'{u).  We  can  have  Xu  C  A(0„)  be  any 
complex  structured  convex  set,  in  particular,  Xu  can  have  exponentially  many  corners 
while  still  having  a  concise  representation.  Even  if  the  only  representation  we  have  for 
Xu  is  the  explicit  one,  Xu  =  H(A'(m)),  the  CEEG  still  gives  an  exponentially  smaller 
representation  than  an  EEG.  In  a  CEEG,  increasing  the  size  of  the  set  A'{u)  does  not 
change  the  game  tree,  and  so  the  corresponding  increase  in  representation  size  is  linear  in 
the  size  of  the  new  entries  added  to  A'{u).  In  an  EEG,  however,  increasing  the  number  of 
actions  A{u)  increases  the  branching  factor  of  the  tree,  producing  an  exponential  blowup 
in  size.^^ 

The  fact  that  CEEGs  concisely  represent  perturbed  EEGs  immediately  gives  a  simple 
polynomial-time  algorithm  for  finding  approximate  trembling-hand  equilibria  (also  called 
perfect  equilibria)  for  extensive-form  games:  namely,  one  simply  solves  the  CEEG  version 
of  the  original  EEG  perturbed  by  tu,e-  Solving  for  perfect  equilibria  (or  some  other  form 
of  sequential  equilibria)  can  be  very  important  in  practice,  but  only  very  recently  have 
algorithms  for  finding  such  equilibria  been  investigated  [Miltersen  and  Sorensen,  2006]. 

We  have  modeled  outcome  uncertainty  efficiently  using  CEEGs,  but  have  not  fully 
tapped  the  class’s  representational  power.  In  particular,  we  have  not  used  the  ability  to 
model  both  players  simultaneously  playing  at  a  single  node,  and  we  have  not  used  the 
ability  to  model  different  numbers  of  outcomes  at  different  states  in  the  same  information 
set.  Both  of  these  abilities  can  potentially  enable  exponentially  smaller  representations.  In 
the  next  section  we  discuss  a  multi-stage  path  planning  problem  where  the  ability  to  have 
both  agents  selecting  actions  simultaneously  is  critical. 

^^The  blowup  is  exponential  if  we  increase  |A(u)|  at  all  u;  if  we  increase  |A(u)|  at  only  a  single  u,  the 
size  of  the  game  tree  increases  multiplicatively. 
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Xq  =  {stochastic  policies  to  intermediate  node} 
Yq  =  {1st  stage  cost  vectors) 

Oq  =  {intermediate  states:  l,...,n} 


X2  =  {1}  (planning  player  not  active) 
Y2  =  A({possible  goal  states:  1, . . .  ,n}) 
O2  =  {possible  goal  states: 


^2,g  =  {stochastic  policies  from  2  to  g} 
>2,9  =  {2nd  stage  cost  vectors) 

02,g  =  0 


Figure  4.4:  The  CEFG  game  tree  for  a  two-stage  path  planning  game. 


4.4.4  Uncertain  Multi-stage  Path  Planning 

We  described  a  two-stage  path  planning  problem  on  a  graph  with  with  adversarial  and 
stochastic  uncertainty  about  costs  and  goals.  There  are  two  players:  the  planner,  who 
starts  at  some  designated  node  in  the  graph,  and  the  adversary.  In  the  first  stage,  the 
planner  chooses  an  initial  policy  to  follow  (taking  her  to  some  intermediate  node);  the 
adversary  has  some  control  over  costs  on  the  edges  in  the  graph,  in  the  manner  of  an 
MDP  with  adversary-controlled  costs.  In  the  second  stage,  the  actual  destination  node  is 
revealed,  and  the  planning  player  then  selects  a  policy  to  go  from  her  intermediate  state  to 
the  revealed  goal;  the  adversary  again  has  some  control  over  costs. 

A  portion  of  a  CEFG  game  tree  for  this  model  is  given  in  Figure  (4.4).  Information 
sets  are  not  shown  for  simplicity.  The  initial  node  in  the  game  tree  corresponds  to  policy 
selection  and  cost  selection  for  the  first  round.  There  is  one  node  in  the  2nd  level  for  each 
possible  intermediate  state.  Only  the  adversary  is  active  at  the  2nd  level,  where  he  selects 
the  goal  state.  The  final  level  of  the  game  tree  has  one  node  for  each  (intermediate  state, 
goal  state)  pair;  the  figure  only  shows  the  states  corresponding  to  intermediate  state  2.  At 
this  level  both  players  are  again  active:  the  planner  chooses  a  policy  to  follow  from  the 
intermediate  state  to  the  goal  state,  and  the  adversary  chooses  a  cost  vector. 
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Using  this  CEFG  representation,  we  can  model  the  following  types  of  uncertainty: 


•  Outcome  uncertainty:  We  view  the  planner  as  operating  in  an  MDP  rather  than 
a  deterministic  path  planning  problem.  Thus,  the  intermediate  node  is  not  deter¬ 
ministically  chosen,  but  rather  will  occur  according  to  some  distribution  induced 
by  the  policy  of  the  planner.  The  MDP  here  is  the  appropriate  path-planning  MDP 
augmented  with  a  “stop-and-wait”  action  (always  available)  that  indicates  that  the 
planner  wants  to  stop  at  the  current  state  and  wait  for  the  next  stage. By  intro¬ 
ducing  time  as  a  state  variable  in  the  MDP  we  can  force  the  planner  to  use  the  stop 
action  by  a  certain  deadline,  or  model  costs  that  increase  as  a  function  of  time  so 
that  an  optimal  policy  will  always  execute  the  stop  action  after  some  finite  amount 
of  time. 

•  Stochastic  and  adversarial  control  of  costs:  In  each  round,  some  combination  of 
randomness  and  adversarial  activity  determines  the  cost  associated  with  each  edge 
(state,  action)  pair  in  the  MDP).  For  example,  it  is  possible  that  first  the  adversary 
selects  a  probability  distribution  from  some  convex  set  of  probability  distributions 
on  costs,  and  then  nature  picks  the  realized  costs  from  that  distribution. 

•  Stochastic  and  adversarial  control  of  the  goal  state:  After  the  first  round,  the 
actual  destination  may  be  selected  by  the  adversary,  or,  as  with  the  costs,  some 
combination  of  randomness  and  adversarial  choice  may  select  the  actual  destination. 

•  Partial  observability:  Information  sets  can  be  used  to  control  what  the  adversary 
knows.  For  example,  the  adversary  may  be  given  complete  knowledge  of  the  plan¬ 
ner’s  intermediate  position  (each  2nd  level  node  in  the  game  tree  is  in  its  own  in¬ 
formation  set),  partial  knowledge  (the  2nd  level  nodes  are  partitioned  into  some 
number  of  information  sets),  or  no  knowledge  whatsoever  (all  2nd  level  states  are 
in  the  same  information  set).  Similarly,  we  can  model  the  planner  having  only  in¬ 
complete  knowledge  of  the  goal  state:  she  would  then  have  to  guess  the  goal  state, 
and  upon  arriving  at  that  state  execute  a  “stop-here-because-Tthink-it-is-the-goal” 
action;  the  reward  received  would  depend  on  whether  or  not  the  state  chosen  was 
actually  the  goal. 

model  this  by  adding  a  terminal  (absorbing)  goal  state  to  the  MDP  that  can  only  be  reached  by 
taking  the  stop-and-wait  action.  Any  proper  policy  for  this  MDP  induces  a  probability  distribution  on  the 
state  where  the  stop-and-wait  action  is  taken;  since  the  stop-and-wait  action  must  be  taken  exactly  once,  this 
distribution  can  directly  be  read  from  the  (state, action)-visitation  frequency  vector  for  the  policy.  Thus,  we 
can  use  this  distribution  for  the  transition  probabilities  in  the  overall  CEFG. 
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Generalizations  to  multiple  rounds  where  information  about  the  final  destination  is  re¬ 
vealed  incrementally  (say,  by  refining  some  subset  in  which  the  destination  actually  lies) 
can  easily  be  constructed.  The  initial  node  could  also  be  chosen  by  the  adversary  or  ran¬ 
domly.  The  adversary  might  or  might  not  have  full  knowledge  of  the  planner’s  initial 
location.  Even  if  the  destination  is  fixed  in  advance,  a  multi-stage  version  of  this  problem 
could  be  interesting  in  that  it  allows  the  planner  to  decide  to  stop  and  wait,  hoping  for 
a  more  favorable  cost  function  on  the  next  round.  For  example,  if  the  planner  reaches  a 
door  that  is  closed  (modeled  as  a  very  high  cost),  she  might  decide  to  stop-and-wait  until 
the  next  round  to  see  if  the  door  opens,  rather  than  taking  a  long  way  around.  It  is  also 
possible  that  the  adversary’s  choice  at  the  2nd  level  of  the  game  tree  determines  not  only 
the  goal  state,  but  also  the  dynamics  of  the  MDP  for  the  3rd  level;  this  can  be  modeled 
as  long  as  the  planning  player  observes  which  dynamics  model  is  active,  as  the  dynamics 
model  determines  the  action  set  Xu  available  to  her. 

There  is  a  rich  tradition  in  operations  research  of  using  both  stochastic  and  adversar¬ 
ial  models  to  handle  uncertainty.  Purely  stochastic  models  of  uncertainty  in  two-stage 
and  multi-stage  problems  have  received  the  most  attention  [Ravi  and  Sinha,  2004,  Gupta 
et  ah,  2004,  Immorlica  et  ah,  2004],  but  purely  adversarial  models  have  also  been  consid¬ 
ered  [Dhamdhere  et  ah,  2005,  Bailey  et  ah,  2006].  The  CEFG  framework  can  bridge  the 
gap  between  purely  adversarial  and  purely  stochastic  formulations,  as  this  example  path¬ 
planning  domain  demonstrates.  However,  the  problems  considered  in  operations  research 
and  stochastic  optimization  are  typically  NP-hard,  and  hence  cannot  have  polynomial  rep¬ 
resentations  as  CEFGs.  Extending  the  CEFG  framework  to  include  mixed-integer  pro¬ 
gramming  models  is  a  exciting  avenue  for  future  work.  It  should  also  possible  to  extended 
our  framework  to  NP-hard  problems  by  modifying  the  algorithms  of  the  next  chapter  to 
use  approximation  algorithms  for  the  best  response  oracles. 


4.5  Conclusions 

In  this  chapter  we  introduced  convex  extensive-form  games,  showed  how  to  transform 
CEFGs  into  convex  games,  and  presented  several  examples  demonstrating  the  modeling 
power  offered  by  the  CEFG  class.  Chapter  3  discussed  several  other  interesting  problems 
that  can  be  modeled  as  convex  games.  While  the  results  of  that  chapter  showed  that  convex 
games  can  be  solved  in  polynomial  time,  in  the  next  chapter  we  turn  our  attention  to 
constructing  algorithms  that  are  much  faster  in  practice  than  the  direct  linear  programming 
approach. 
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Chapter  5 

Fast  Algorithms  for  Convex  Games 


In  this  chapter  we  introduce  a  family  of  practical  algorithms  for  solving  convex  games, 
with  a  particular  focus  on  the  application  of  the  algorithms  to  MDPs  with  adversarial  costs 
and  extensive-form  games.  For  a  review  of  convex  games,  refer  back  to  Chapter  3.  We  be¬ 
gin  with  a  discussion  of  best  response  algorithms  for  particular  convex  games,  and  present 
the  well-known  fictitious  play  algorithm  that  can  exploit  such  oracles.  Section  5.2  intro¬ 
duces  a  special-purpose  algorithm  for  the  problem  of  planning  in  a  MDP  where  an  adver¬ 
sary  selects  the  cost  vector  from  a  small  finite  set  of  possibilities.  Section  5.3  then  presents 
our  general  convex  game  algorithm,  beginning  with  an  intuitively  straightforward  version 
and  then  proceeding  to  our  full  algorithm  which  addresses  some  deficiencies  of  the  sim¬ 
plified  version.  Finally,  in  Section  5.5.2  we  present  experiments  on  both  adversarial-cost 
MDPs  and  on  EFG  representations  of  Rhode  Island  Hold’em  poker.  Our  results  demon¬ 
strate  dramatic  improvements  over  commercial  linear  programming  software. 


5.1  Best  Responses  and  Fictitious  Play 

A  central  feature  of  the  algorithms  we  present  in  this  chapter  is  that  they  leverage  fast 
best-response  oracles.  Consider  the  convex  game  G  =  (X,  Y,  M),  and  suppose  one  player 
(say,  player  y)  fixes  a  strategy  y  G  Y.  Then,  letting  c  =  My  (think  of  c  as  a  cost  vector), 
the  best-response  problem  is  to  compute: 

min  c-x.  (5.1) 

X^X 

If  X  is  a  polyhedron,  then  this  is  just  a  standard  linear  program.  But,  in  many  cases, 
much  faster  algorithms  are  available  for  solving  Equation  (5.1).  In  the  case  of  cost-paired 
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MDP  games,  solving  Equation  (5.1)  is  exactly  the  problem  of  planning  in  an  MDP  with 
known  costs;  this  problem  can  be  solved  efficiently  by  any  number  of  algorithms,  for 
example  value  iteration  or  even  A*  in  the  special  (but  practically  very  useful)  case  of 
positive  costs  and  deterministic  transitions.  For  optimal  oblivious  routing.  Equation  (5.1) 
corresponds  to  solving  a  multi-commodity  flow  problem  for  one  of  the  players.  And  in  the 
case  of  extensive-form  games,  finding  a  best-response  policy  is  accomplished  efficiently 
via  a  special  dynamic  program,  as  discussed  in  Section  3.2.  In  all  of  these  cases,  the 
special-purpose  algorithms  are  likely  to  perform  much  better  than  applying  generic  linear 
programming  techniques. 

Recall  that  X  C  MX  and  Y  C  M”.  For  the  remainder  of  this  chapter,  we  assume  we 
have  efficient  algorithms  (best-response  oracles)  BRx  ;  X  and  BRy  ;  M”  — F 

for  solving  Equation  (5.1)  We  view  these  oracles  as  functions  from  cost  vectors  (rather 
than  opponent  strategies)  to  strategies,  so  x  =  BRx  (My)  is  a  best  response  for  x  to  the 
strategy  y,  and  similarly  y  =  BRy(a;^M)  gives  a  best  response  for  y  to  x.  The  matrix- 
vector  multiplications  with  M  are  often  a  dominating  computational  cost,  and  so  explicitly 
tracking  such  multiplications  is  important;  however,  to  avoid  clutter  in  our  pseudo-code  we 
hide  the  multiplications  with  M,  for  example  writing  x  =  BRx(y).  Further,  in  the  cases 
just  described,  the  best-response  algorithms  are  better  thought  of  as  functions  of  some 
suitably  chosen  cost  or  reward  vector  and  do  not  depend  in  any  way  on  the  properties  of 
M. 

It  is  natural  to  look  for  algorithms  for  solving  the  overall  game  that  can  exploit  these 
special  purpose  best-response  oracles.  One  simple,  well-studied  algorithm  that  accom¬ 
plishes  this  is  fictitious  play:  the  algorithm  simulates  two  players  repeatedly  playing  the 
convex  game  G.  Each  time  G  is  played,  each  player  chooses  to  play  a  best  response  to 
the  average  of  all  her  opponent’s  previous  plays. ^  While  no  guarantees  can  be  made  about 
the  performance  of  each  of  these  players  in  the  simulation,  the  average  over  their  past 
plays  eventually  converges  to  a  minimax  equilibrium.  For  a  recent  treatment  of  fictitious 
play,  see  [Eeslie  and  Collins,  2006].  Pseudo-code  for  this  simple  algorithm  is  given  in 
Figure  (5.1).  The  average  of  x’s  plays  is  and  on  each  iteration  this  average  is  updated 
by  taking  a  step  towards  =  BRx(My'^"’^'').  Each  call  to  a  best-response  oracles  gen¬ 
erates  an  upper  or  lower  bound  for  the  minimax  value  v*  of  G\  if  x  (the  min  player)  plays 
then  the  max  player  y  can  do  no  better  than  playing  y  =  BRy((a;'^"'^'')^M),  and  so 
we  conclude  v*  <  V y).  A  similar  argument  holds  for  calls  to  BRx.  The  sequence 
of  bounds  corresponding  to  need  not  improve  mono  tonic  ally,  so  in  line  (1) 

we  use  max  and  min  to  guarantee  a  monotonic  sequence.  An  implementation  can  then 

'Because  the  sets  X  and  Y  are  convex,  this  average  is  also  a  valid  strategy,  and  hence  we  can  compute  a 
best  response  to  it. 
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^  any  strategy  in  X 
^  any  strategy  in  Y 

lb  < - cx)  ub  ^  cx) 

t  ^  0 

while  ((ub  —  lb)  >  e) 

t^t  +  1 

xr^  ^  BR,(|/-Y)  ^  BRy(a;-Y)  (1) 

V.  =  V{xTX  yr^)  Vy  =  V{xr\  1/-Y) 

lb  ^  max(lb,  Vy)  ub  ^  min(ub,  Vx) 

™cntr  ^ T,cntr  i  {  ™srch  „,cntr  ^ _  f  „,cntr  i  f  1_'\  „,srch 

^  {t+i)  ^t-1  +  [t+i)  vt  ^  + \t+i)yt 

end 

return  corresponding  to  ub  and  lb,  respectively 


Figure  5.1:  The  fictitious  play  algorithm. 


track  the  corresponding  argmax  and  argmin  strategies,  and  return  these  if  the  algorithm 
is  interrupted  and  asked  to  produce  a  solution  in  an  anytime  fashion;  this  pair  of  strategies 
forms  a  (ub  —  lb) -approximate  minimax  equilibrium.  This  anytime  ability  to  produce  a 
pair  of  strategies  that  form  an  e-approximate  minimax  equilibrium  is  very  attractive,  as  it 
allows  us  to  trade  solution  quality  against  computation  time. 

This  anytime  performance  can  be  particularly  important  when  considering  very  large 
games  where  abstractions  (approximations)  must  be  introduced  to  make  any  solution  pos¬ 
sible.  For  example,  there  has  been  much  recent  work  on  abstraction  for  extensive-form 
games,  and  poker  in  particular  [Billings  et  ah,  2003,  Gilpin  and  Sandholm,  2006a].  In 
such  applications,  approximately  solving  a  larger  (less  abstracted)  game  may  be  prefer¬ 
able  to  exactly  solving  a  more  heavily  abstracted  version. 

There  is  a  close  connection  between  fictitious  play  (especially  smooth  versions  of  ficti¬ 
tious  play)  and  running  a  pair  of  no-regret  algorithms  in  self-play,  one  for  each  player.  For 
example,  the  algorithms  of  Kalai  and  Vempala  [2003]  and  Gordon  [2005]  can  be  used  in 
self-play  in  the  same  general  form  as  Algorithm  (5.1);  the  best-response  oracle  is  replaced 
with  a  special-purpose  oracle  that,  intuitively,  introduces  additional  smoothing  such  that 
the  agent  randomizes  among  strategies  that  have  similarly  good  performance.  The  regret 
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bounds  for  such  algorithms  immediately  give  both  convergence-rate  guarantees  as  well  as 
performance  guarantees  for  the  agents  in  the  simulation. 

We  now  investigate  another  method  that  utilizes  a  best-response  oracle;  however,  in 
this  case  the  oracle  is  used  to  generate  separating  hyperplanes  (cutting  planes)  in  a  linear 
programming  approach.  In  the  next  section  we  concentrate  on  the  particular  case  of  an 
MDP  with  adversarial  costs;  we  introduce  more  general  algorithms  for  convex  games  in 
the  following  section. 


5.2  The  Single-Oracle  Algorithm  for  MDPs  with 
Adversarial  Costs 

In  this  section  we  develop  an  efficient  algorithm  to  solve  MDP  planning  problems  where 
an  adversary  selects  the  cost  vector  from  a  finite  set  K  of  possible  costs.  This  problem  was 
introduced  in  Section  3.4.  In  particular,  we  use  Benders’  decomposition  [Benders,  1962] 
to  capitalize  on  the  existence  of  best-response  oracles  like  A* -search  and  value  iteration. 
The  double  oracle  algorithms  introduced  later  generalize  this  technique  to  the  case  where 
a  best-response  oracle  is  also  available  for  the  adversary. 

Recall  the  problem  formulation  from  Section  3.4.2:  We  have  an  MDP  Ai  with  known 
dynamics  and  a  fixed  start-state  distribution  and  a  set  it'  =  {ci, . . . ,  c^}  of  cost  vectors. 
Simultaneously,  player  x  selects  a  policy  tt  for  M  and  player  y  selects  a  cost  vector  c  e  K. 
Then,  player  x  pays  y  the  amount  V (tt,  c),  the  expected  cost  of  following  policy  tt  from  a 
state  sampled  from  under  cost  vector  c  G  K.  The  dynamics  of  the  MDP  are  captured 
by  the  matrix  E.  For  a  fixed  start-state  distribution  /i*,  the  set  of  stochastic  policies  for 
player  x  can  be  represented  as  the  set  of  valid  state-action  visitation  frequencies, 

F  =  \E^f  +  fi,  =  0,  />0}. 

Thus,  the  game  can  be  formulated  as  the  convex  game  {E,  A(iF),  M),  where  M  is  the 
matrix  with  columns  ci, . . . ,  c^. 

Our  iterative  algorithm  is  an  application  of  Benders’  decomposition,  a  general  method 
for  decomposing  certain  linear  programs  first  studied  by  Benders  [1962].  We  focus  on  the 
application  of  this  technique  to  the  problem  at  hand,  and  refer  the  reader  to  Bazaraa  et  al. 
[1990]  for  a  more  general  introduction.  Benders’  decomposition  is  dual  to  the  Dantzig- 
Wolfe  decomposition,  and  can  also  be  viewed  as  a  specialization  of  the  Kelley  cutting 
plane  method  to  linear  programs  [Hiriart-Urruty  and  Lemarechal,  1993]. 
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The  set 


H{K)  =  {c\c  =  Mq,q  e  A{K)} 

is  the  set  of  all  (expeeted)  eost  veetors  the  adversary  ean  potentially  aehieve  by  playing 
implieit  mixed  strategies  from  A{K).  Our  algorithm  is  applieable  when  we  have  an  oraele 
BRx  ;  H(i^)  — F  that  for  any  eost  veetor  c  G  H(i^)  provides  a  best-response  poliey  tt. 
Our  algorithm  requires  that  tt  be  represented  by  its  state-aetion  frequeney  veetor  f-„.  If 
the  oraele  algorithm  aetually  used  provides  a  poliey  represented  as  a  value  funetion  (for 
example,  if  we  implement  the  oraele  using  value  iteration)  we  ean  ealeulate  with  a 
matrix  inversion  or  by  iterative  methods. 

We  quiekly  restate  the  relevant  linear  programming  results  from  Seetion  3.4.3.  The 
value  of  the  MDP  for  for  a  fixed  eost  c  (typieally,  c  =  Mq  for  the  game)  ean  be  found  by 
solving  the  linear  program 


or  via  the  dual, 


max  V  ■  fig 

V 

subjeet  to  Ev  +  c  >  0, 


min  f  ■  c 

f 

subjeet  to  E^f  +  fig  =  0 

/>0. 


We  ean  solve  the  adversarial  MDP  eonvex  game  via  the  linear  program 


or  via  its  dual. 


max  V  ■  /is 

v,q 

subjeet  to  Ev  -f  Mq  >  0 

l-g=  1 

g  >  0, 


min 

subjeet  to  f  +  /i^  =  0 

1-Z  +  M^f  <  0 

/>0. 


(5.2) 


(5.3) 


(5.4) 


(5.5) 


133 


Figure  5.2:  The  piece-wise  linear  concave  function  V  (dotted  line)  and  an  approximation 
Vg  based  on  the  bundle  =  {/i,  /2,  /s}  (the  minimum  of  the  three  thin  black  lines).  The 
maximum  with  respect  to  the  approximation  Vb  is  at  gi  =  0.7. 


Let  V (g)  be  the  optimal  value  of  (5.2)  for  a  fixed  cost  vector  c  =  Mg  for  g  G  A(K); 
we  can  evaluate  V  using  the  best-response  oracle  BRx.  Then,  we  can  rewrite  (5.4)  as  the 
program 

V  ■  Us 


max 

q&A{K) 


max 

V 


subject  to  Ev  +  Mg  >  0, 


=  max  V(g) 

gGA(ir) 


(5.6) 


We  will  work  with  the  right-hand- side  of  this  representation.  Unfortunately,  V  (g)  is  not 
linear  so  we  cannot  solve  the  program  directly  as  a  linear  program  over  A(iL).  However, 
it  can  be  solved  via  a  convergent  sequence  of  approximations  that  capitalize  on  the  avail¬ 
ability  of  our  oracle  BRx.  Using  strong  duality  for  linear  programming  and  Equation  (5.3), 
we  can  rewrite  V  as 

V  (g)  =  min  (Mg)  ■  f. 


Since  V  is  the  minimum  over  a  polyhedral  set  of  linear  functions,  it  is  piecewise  linear  and 
concave  [see  Boyd  and  Vandenberghe,  2004,  for  example].  Let  B^  C  F  he  a  finite  subset 
of  F.  Then, 

Vsig)  =  min  (Mg)  ■  f, 
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is  a  piecewise  linear  and  concave  upper  bound  on  V.  If  Cn(F)  C  then  it  can  be  shown 
that  Vb  =  V.  Figure  (5.2)  shows  V  and  Vb  for  an  example  with  \K\  =  2,  and  so  A(iF) 
effectively  has  a  single  dimension.  The  dotted  line  indicates  1/;  it  is  defined  in  the  case  by 
7  linear  segments.  The  approximation  Vb,  based  on  the  set  of  strategies  =  {/i,  /2,  fs}, 
gives  an  upper  bound;  the  rest  of  the  figure  will  be  explained  shortly. 

Our  algorithm  performs  two  steps  for  each  iteration.  On  iteration  i,  first  we  solve  for 
an  optimal  mixture  of  costs  g*  under  the  assumption  that  the  planner  is  only  allowed  to 
select  a  policy^  from  a  restricted  set  i3x  =  {tti,  7r2, . . . ,  tt*}  C  F.  Then,  we  use  the  oracle 
to  compute  BRx(Mgi)  =  Hi+i,  an  optimal  deterministic  policy  with  respect  to  the  fixed 
cost  vector  c  =  Mqi.  The  policy  TTj+i  is  added  to  B^,  and  these  steps  are  iterated  until  an 
iteration  where  TTj+i  is  already  in  B^. 

All  that  remains  is  to  show  how  to  find  the  optimal  cost  mixture  g*  given  that  the  planner 
will  select  a  policy  from  the  set  i3x  =  {/i,  /2,  •  •  • ,  fi}  of  feasible  solutions  to  (5.3).  That 
is,  we  wish  to  solve  Equation  (5.6)  with  V  replaced  by  Vb-  This  problem  can  be  solved 
with  the  linear  program 


maxn  subject  to  (5.7) 

q,v 

V  <  fjMq  for  I  <  j  <  i, 

which  is  essentially  the  same  program  as  (5.5),  where  /  is  restricted  to  be  a  member  of 
i3x  rather  than  an  arbitrary  stochastic  policy.  The  key  difference  is  that  fj  M  is  a  constant 
vector  for  each  fj  G  B^,  and  so  the  size  of  the  linear  program  is  independent  of  the  size  of 
the  MDP  A4  (there  is  no  dependence  on  [S'!  or  |A|,  as  there  is  in  Equations  (5.5)  and  (5.4). 
We  interpret  this  program  as  solving  the  matrix  game  with  one  column  for  each  c  E  K,  and 
one  row  for  each  /  E  B^.  This  program  is  known  as  the  master  program  of  the  Benders’ 
decomposition.  Equation  (5.3)  is  the  slave  program,  which  in  our  case  is  solved  not  as  a 
linear  program,  but  using  the  fast  best-response  oracle.  In  Eigure  (5.2),  solving  the  master 
program  (5.7)  for  the  set  i3x  =  {/i,  /2,  /s}  produces  the  minimax  solution  gi  =  0.7  for  the 
cost-selecting  player  (and  so  g2  =  0.3).  The  algorithm  then  computes  the  corresponding 
cost  vector,  c  =  0.7ci  -f  0.3c2,  and  then  calls  BRx(c),  which  returns  the  best-response 
strategy  f^.  This  strategy  is  then  added  to  B^  and  the  process  continues.  Pseudo-code  for 
the  complete  algorithm  is  given  in  Eigure  (5.3). 

Since  Vb  gives  an  upper  bound  on  V,  the  solution  to  the  master  program  (5.7)  gives  an 
upper  bound  on  the  value  of  the  game.  Since  we  only  ever  tighten  the  approximation  of  V, 
the  sequence  of  upper  bounds  generated  by  the  algorithm  is  non-decreasing.  As  mentioned 

^We  write  and  f  equivalently,  depending  on  which  interpretation  we  wish  to  emphasize:  tt^  is  a 
stochastic  policy,  and  f  is  the  corresponding  state-action  visitation  frequency  vector 


135 


-  {} 

go  ^  arbitrary  q  G  A  (it') 

lb  < - oo  ub  4—  cx) 

t  ^  1 

while  ((ub  —  lb)  >  e) 

ft  =  BR,(Mgi_i) 

lb  ^  max(lb,  V^(/t,gt_i)) 

(gt,  v)  ^  solution  to  the  LP  of  Equation  (5.7) 
ub  4—  y  U  V  improving  monotonically 
t^t  +  l 
end 

return  best  (/,  g) 


Figure  5.3:  The  single  oraele  algorithm. 

before,  eaeh  eall  to  BRx  generates  a  lower  bound  on  the  value  of  the  game,  but  these  need 
not  be  monotonieally  inereasing,  and  so  we  use  the  max  operator. 

The  e-approximate  minimax  eost  mixture  is  given  by  the  g  eorresponding  to  the  best 
lower  bound.  The  approximately  optimal  poliey  for  the  planning  player  ean  be  expressed 
as  a  distribution  over  the  polieies  in  These  values  are  given  by  the  dual  variables  (say, 
p)  of  (5.7),  and  ean  thus  be  found  via  matrix  inversion  or  may  be  immediately  available 
depending  on  the  linear  programming  teehnique  used  to  solve  the  program.  Given  the  dual 
variables  p,  we  eompute  the  best  /  as  the  stoehastie  poliey 

t 

/  = 

i=l 

The  eonvergenee  and  eorreetness  of  this  algorithm  are  immediate  from  the  eorrespond¬ 
ing  proofs  for  Benders’  deeomposition;  in  the  worst  ease  all  of  the  strategies  in  Cn(F)  will 
be  added  to  ensuring  finite  though  possibly  exponential  runtime.  In  Seetion  5.5.1,  we 
demonstrate  experimentally  that  the  sequenee  of  bounds  eonverges  quiekly  in  praetiee. 
We  refer  to  this  algorithm  as  the  single  oraele  algorithm  beeause  it  relies  only  on  a  best- 
response  oraele  for  the  row  player.  While  we  have  stated  the  algorithm  of  this  seetion  in 
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terms  of  the  particular  convex  game  (F,  A{K),M)  for  MDPs  with  adversarial  costs,  this 
technique  can  be  applied  any  time  one  of  the  players  in  the  convex  game  has  strategies 
given  via  a  relatively  small  explicitly  enumerated  set  let  K.  The  constraint-generation 
approach  used  here  can  also  be  applied  to  solving  the  linear  programs  of  Equation  (5.5) 
or  Equation  (5.4)  via  the  ellipsoid  or  analytic  centering  algorithms  [Bazaraa  et  ah,  1990, 
Hiriart-Urruty  and  Eemarechal,  1993]. 


Motivation  for  the  Double  Oracle  Algorithm 

The  single  oracle  algorithm  is  sufficient  for  problems  when  the  set  K  is  reasonably  small; 
in  these  cases  solving  the  master  problem.  Equation  (5.7),  is  fast.  Eor  example,  we  use  this 
approach  in  our  path  planning  problem  if  the  opponent  is  confined  to  a  small  number  of 
possible  sensor  locations  and  we  know  that  he  will  place  only  a  single  sensor.  However, 
suppose  there  are  a  relatively  large  number  of  possible  sensor  locations  (say  50  or  100), 
and  that  the  adversary  will  actually  place  2  sensors.  If  the  induced  cost  function  assigns 
an  added  cost  to  all  locations  visible  by  one  or  more  of  the  sensors,  then  we  cannot  de¬ 
couple  the  choice  of  locations,  and  so  there  will  be  (^2°)  possible  cost  vectors  in  K.  The 
single  oracle  algorithm  is  not  practical  for  a  problem  with  this  many  cost  vectors;  simply 
representing  them  all  in  memory  would  be  prohibitive. 

We  now  derive  a  generalization  of  the  single  oracle  algorithm  that  can  take  advan¬ 
tage  of  best-response  oracles  for  both  players.  We  present  this  algorithm  as  it  applies  to 
arbitrary  convex  games. 


5.3  A  Bundle-based  Double  Oracle  Algorithm 

In  this  section,  we  introduce  two  algorithms  that  take  advantage  of  a  best-response  oracle 
for  both  players.  The  basic  double  oracle  bundle  algorithm  (first  introduced  in  [McMahan 
et  ah,  2003])  is  described  first;  we  then  extend  basic  DOBA  with  line  search,  aggrega¬ 
tion,  and  a  method  for  interpolating  with  fictitious  play.  We  call  this  extended  algorithm 
DOBA-t. 

The  basic  DOBA  builds  up  a  collection  of  strategies  (called  a  bundle)  for  each  player. 
On  each  iteration  it  solves  an  approximate  game  where  each  player  is  only  allowed  to 
randomize  among  the  strategies  contained  in  his  or  her  bundle.  Given  the  optimal  strate¬ 
gies  in  this  restricted  game,  it  calls  the  oracles  to  find  best  responses  for  each  player  in 
the  full  game,  and  then  adds  these  responses  to  the  bundles  to  improve  the  approxima¬ 
tion.  The  double  oracle  bundle  method  is  related  to  the  family  of  cutting  plane  and  bun- 
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-  {1/0} 


^  {xo} 

Mxo,yo  ^  yiXo,yo) 

lb  < - cx)  ub  4—  cx) 

t  ^  0 


while  ((ub  —  lb)  >  e) 


(p,g) 

™mix 

^  solveMatrixGame(M) 

„  mix  , 

yt 

Xt+1  ■ 

^  BRy,{yr) 

yt+i 

Vx  = 

V{xf\yt+i) 

Vy  = 

V{xt+iyyr) 

lb  ^ 

max(lb,  Vy) 

ub  ^ 

-  min(ub,  Vx) 

j^t+i 

X 

^BiU  {xt+i} 

^t+1 

y 

^B\VJ  {yt+i} 

{W^By)Mx,^,,y,^V{xt+i,y') 
(Vx'  e  By,)  Mx',yt+i  ^  V{x',yt+i) 
t^t  +  1 


end 

return  best  y'^'^) 


Figure  5.4:  The  basie  double  oracle  bundle  algorithm. 


die  algorithms  for  non-smooth  optimization,  and  to  Benders’  decomposition  in  the  case 
of  polyhedra  [Hiriart-Urruty  and  Lemarechal,  1993].  However,  the  direct  application  of 
those  techniques  to  convex  games  yields  algorithms  that  only  take  advantage  of  the  best- 
response  oracle  for  one  of  the  players,  not  both.  There  is  great  potential  for  future  work  in 
adapting  the  rich  set  of  techniques  from  that  literature  to  the  particular  problem  of  solving 
convex  games;  the  work  presented  here  is  but  a  first  step  in  that  direction. 


5.3.1  The  Basic  Algorithm 

In  this  section  we  describe  the  basic  double  oracle  algorithm.  This  algorithm  can  require 
an  amount  of  memory  exponential  in  the  problem  size.  While  it  still  manages  very  good 
performance  on  adversarial  cost  MDP  problems,  it  is  of  only  theoretical  interest  for  solv- 
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ing  large  extensive-form  games  like  Rhode  Island  Hold’em;  the  full  DOBA-i-  algorithm, 
introduced  next,  builds  on  the  basic  double  oracle  algorithm  by  providing  a  way  to  limit 
memory  consumption.  Pseudo-code  for  the  basic  algorithm  is  given  in  Figure  (5.4). 

The  principal  intuition  of  the  double  bundle  oracle  algorithms  is  to  use  our  best- 
response  oracles  to  build  up  an  approximate  version  of  the  full  convex  game.  Fix  G  = 
(X,  F,  M)  as  the  convex  game  we  wish  to  solve.  Our  approximate  game  G  will  also  be 
convex,  given  by  {X,  Y,  M),  where  X  C  X  and  Y  C  Y.  The  set  X  will  be  constructed 
from  a  set  of  best  responses  for  x  to  various  player  y  strategies,  and  similarly  for  Y.  It 
should  be  clear  that,  if  X  approaches  X  and  Y  approaches  Y,  the  approximate  game  be¬ 
comes  more  and  more  similar  to  the  exact  one.  Of  course,  we  hope  that  the  approximation 
becomes  good  before  X  and  Y  become  intractably  large.  The  key  difference  between  the 
basic  algorithm  and  DOBA-i-  is  that  the  latter  explicitly  manages  the  complexity  of  X  and 
Y  while  still  guaranteeing  convergence  to  a  solution  to  the  overall  game  G. 

In  both  the  basic  algorithm  and  DOBA-i-,  we  maintain  a  finite  bundle  of  strategies  for 
each  player,  B^YX  and  By  YY  respectively.  The  convex  hull  of  written  H(i3x),  is  a 
convex  subset  of  X,  and  similarly  H(i3y)  is  a  convex  subset  of  Y.  Thus,  we  can  define  the 
convex  game  G  =  (H(i3x),  H(i3y),  M)  which  we  will  use  as  a  model  of  G. 

We  can  define  an  equivalent  matrix  game  M  which  has  strategy  sets  i3x  for  player  x 
and  By  for  player  y,  with  payoffs  M{x,  y)  =  V {x,  y).  The  convex  game  G  is  equivalent 
to  M,  in  the  sense  that  strategies  in  G  and  M  can  be  translated  back  and  forth  without 
altering  expected  payoffs.  More  precisely,  the  bi-linearity  of  V  means  that  if  (p,  q)  is  a 
solution  to  M,  then  x""^  =  Qiy)y  form  a  solution  to  G. 

We  will  move  back  and  forth  between  the  two  equivalent  representations  G  and  M.  For 
interpretation  we  will  use  G,  since  its  relationship  to  G  is  more  clear.  But  for  computation 
we  will  work  with  M,  since  its  size  is  independent  of  m  and  n  (it  is  |i3x|  x  This  last 
fact  is  critical  for  large  games:  for  example,  in  Rhode  Island  Hold’em,  m  and  n  are  both 
approximately  1  x  10®,  while  we  fix  |i3x|  =  \By\  =55. 

Both  the  basic  algorithm  and  DOBA-i-  build  up  the  model  game  M  in  an  intuitive  way 
using  our  best-response  oracles.  We  initialize  the  bundles  with  one  or  more  arbitrarily 
chosen  strategies  for  each  player.  These  could  be  the  an  arbitrary  or  randomly  chosen 
strategy,  but  there  is  the  opportunity  to  increase  performance  by  seeding  the  algorithm 
with  a  collection  of  expert-generated  strategies. 

Given  the  current  bundles  B^  and  By,  we  solve  the  corresponding  matrix  game  M, 
producing  a  mixed  strategy  (p,  q).  We  then  compute  the  corresponding  strategies  x"^'^  G  X 
and  y""^  G  Y ;  is  a  valid  strategy  pair  in  either  G  or  G.  Because  is 

a  strategy  pair  for  G,  we  can  use  our  oracles  to  generate  best  responses  BRx((a;'^'’')^M) 
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and  BRy{My'^'^).  We  add  these  new  strategies  to  the  bundle,  and  also  use  the  fact  that 
they  are  best  responses  to  update  upper  and  lower  bounds  on  the  value  of  the  game  G.  The 
algorithm  continues  in  this  fashion  until  the  gap  e  =  (ub  —  lb)  reaches  an  acceptably  small 
level. 


Discussion  On  each  iteration,  the  size  of  each  bundle  increases  by  one,  as  does  each 
dimension  of  the  matrix  game  M.  This  is,  in  fact,  the  principal  weakness  of  the  basic 
algorithm:  the  cost  of  each  iteration  and  the  size  of  the  bundles  grow,  making  it  infea¬ 
sible  to  run  an  arbitrary  number  of  iterations.  In  particular,  for  Rhode  Island  Hold’em, 
storing  each  strategy  in  the  bundle  requires  about  7MB  of  memory,  and  so  physical  mem¬ 
ory  rapidly  limits  the  size  of  the  bundles  and  hence  the  number  of  iterations  we  can  run. 
To  address  this  issue,  DOBA-i-  uses  an  aggregation  and  pruning  scheme  that  allows  it  to 
maintain  a  constant  bundle  size. 

A  second  deficiency  of  the  basic  double  oracle  algorithm  is  that  inaccuracies  in  the 
model  G  can  lead  to  solutions  that  in  fact  perform  poorly  in  the  true  game  G. 

However,  the  direction  from  the  current  best  pair  of  strategies  {x*,  y*)  towards 
usually  provides  a  good  direction  of  improvement.  To  exploit  this  fact,  we  introduce  a 
fast  line  search  procedure  that  efficiently  solves  this  1 -dimensional  optimization  problem. 
Pseudo-code  for  the  complete  DOBA-i-  algorithm  is  given  in  Figure  (5.5);  in  the  following 
sections  we  discuss  the  principal  improvements  over  the  basic  version. 


5.3.2  Aggregation 

Our  aggregation  and  pruning  scheme  has  two  components.  First,  we  insert  the  minimax 
strategies  x'^'^  and  y"^'^  into  the  bundles.  This  has  no  effect  on  the  convex  hulls  of  the 
bundles  if  we  never  remove  strategies,  but  since  we  will  be  discarding  strategies,  adding 
the  mixtures  is  useful:  in  this  way  even  if  we  throw  out  some  strategies  that  support  x'^'^, 
we  may  still  keep  x™^  G  H(i3x)  by  explicitly  placing  x"^'^  G  B^. 

In  order  to  determine  which  strategies  to  discard,  each  time  we  solve  M  we  use  the 
mixed  strategies  to  update  a  weight  w{x)  (or  w{y))  associated  with  each  strategy  in  the 
bundle:  this  weight  is  a  discounted  average  of  the  probabilities  placed  on  the  strategy  by 
past  solutions  to  M.  Each  iteration,  we  choose  to  remove  the  strategies  with  the  smallest 
weights;  we  then  add  to  the  bundle  an  aggregate  of  the  removed  strategies,  with  each 
removed  strategy  weighted  proportionally  to  its  weight.  That  is,  if  we  remove  xi, . . . , 
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while  ((ub  —  lb)  >  e) 

(p,  q)  ^  solveMatrixGame(M) 
update  strategy  weights 
^  ExeBxP(*)® 

^  Ey&By  ^(y)y 
updateCenter(x) 

ub  ^  min  (ub,  ,  BRy(x'^"^'’))) 

updateCenter(y) 

lb  <—  max  (lb,y(BRx(y^"t^),y^"t^)) 
if(^x  or  By  are  too  big) 

do  aggregation  on  or  By 
update  (f) 
t^t+l 


updateCenter  (x) : 

a;srch  ^  search(BRx(y""tO,2:'^'^,  [0, 1  -  (/.]) 
fpstep  ^  l/(t  +  1) 
a  ^  (j)  ■  fpstep 

P  ^  fpstep  +  (1  —  </>)(!  —  fpstep) 
a;cntr  ^  search [a,P]) 
add  BRx(y'"'^)}  to  B^  (*) 

add  BRx(y"'’tO}  to  B^  (*) 

end 


Figure  5.5:  DOBA+:  the  double  oracle  bundle  algorithm  with  line  search,  aggregation, 
and  convergence  guarantees.  Initialization  and  updates  to  M  are  similar  to  those  in  the 
basic  algorithm. 


and  let  W  =  Yli=i  ^(3:*),  then  the  aggregate  strategy  is  given  by 


Xaggr 


2=1 


To  keep  the  bundle  size  constant,  we  remove  five  strategies  on  each  step:  one  each  to 
to  make  room  for  the  four  strategies  added  to  the  bundle  in  the  lines  marked  (*)  in  Fig¬ 
ure  (5.5),  and  one  to  make  room  for  the  aggregated  strategy  x^ggi-  Of  course,  when  we 
remove  strategies  from  the  bundles  we  must  remove  the  corresponding  rows  and  columns 
from  M  as  well. 
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5.3.3  Line  Search 


For  extensive-form  games,  it  takes  time  0{m)  to  run  the  oracle  BRx(cj^)  for  a  fixed  cost 
vector  Cy  =  My,  but  the  cost  of  the  multiplication  to  compute  Cy  is  0{nm).  While  the 
matrix  M  may  be  sparse,  multiplications  with  M  will  still  typically  be  slower  than  best- 
response  calls  by  a  considerable  constant;  for  example,  this  constant  is  around  20  for 
Rhode  Island  Hold’em,  and  even  higher  for  approximately  abstracted  versions  of  Texas 
Hold’em.  However,  the  multiplication  Mq  for  adversarial  MDPs  with  only  a  few  possible 
cost  vectors  (small  \K\)  will  be  relatively  inexpensive,  usually  much  cheaper  than  solving 
the  MDP  with  the  fixed  cost  vector  c  =  Mq.  Thus,  the  applicability  of  the  techniques  of 
this  section  will  depend  on  these  tradeoffs  for  the  particular  convex  game  at  hand. 

In  this  section  we  show  how  we  can  take  advantage  of  the  relative  speed  of  com¬ 
puting  best  responses  for  fixed  cost  vectors.  Consider  a  restricted  convex  game  with 
-Bx  =  {xi,X2}  and  X  =  H(i3x)  but  Y  =  Y.  That  is,  x  has  exactly  two  strategies,  while  y 
has  full  access  to  his  strategy  set.  We  show  that  we  can  solve  the  corresponding  restricted 
game  (H(i3x),  Y,  M)  efficiently  via  a  line  search. 

The  key  is  that  x’s  choice  of  a  probability  distribution  over  i3x  only  has  a  single  degree 
of  freedom.  Using  6  to  represent  this  free  variable,  we  can  write  the  problem  of  solving 
this  game  as: 

min  m.a.yi  {{1  —  9)xi  +  9x2)^  My  (5.8) 

ee[o,i]  ydY 

For  simplicity,  we  write  x{9)  =  ((1  —  9)xi  +  9x2).  Then,  define  the  function  /  ;  M  — M 
by 

f{9)  =  max  x{9)'^My  (5.9) 

y£Y 

and  so  solving  Equation  (5.8)  is  equivalent  to  solving  min^gjo,!]  f{9).  In  fact,  /  is  just  a 
piecewise  maximum  over  a  set  of  affine  functions,  one  for  each  y  eY  ,  and  so  /  is  convex. 
We  can  minimize  such  a  function  via  an  exact  binary  line  search  if  we  can  evaluate  /  at 
all  9  G  [0, 1]  and  also  compute  a  subgradient  to  /  at  each  9.  The  best-response  oracle  BRy 
can  be  used  to  accomplish  both  these  tasks. 

For  a  fixed  9,  we  can  find  a  y  that  achieves  the  maximum  in  Equation  (5.9)  by  com¬ 
puting  y  =  BRy(a;(6')^M),  so  that  f{9)  =  V (x(9),y).  Eurther,  y  corresponds  to  the  linear 
function  x{9)'^My  which  gives  a  lower  bound  on  /  and  is  tight  at  9,  so  the  slope  of  fy  is  a 
subgradient  of  /  at  9.  This  can  easily  be  calculated  as  {x2  —  xi)'^My. 

This  gives  us  the  necessary  components  to  implement  a  line  search,  but  each  evaluation 
and  subgradient  calculation  may  require  a  multiplication  with  M.  We  avoid  these  multipli¬ 
cations  in  the  following  way.  Eet  ci  =  xjM  and  C2  =  which  we  can  pre-compute 
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before  the  line  search^  and  define  c(6*)  =  (1  —  0)ci  +  602-  For  a  fixed  6,  we  can  evaluate 
c{9)  in  0{m)  time.  After  computing  y  =  BRy(c(6')),  we  calculate"^  f{9)  as  c{9)  ■  y.  Using 
the  same  y,  we  can  then  calculate  the  necessary  subgradient  as  (c2  —  ci)  ■  y.  This,  each 
iteration  of  the  binary  search  can  be  completed  in  0{m)  time. 

The  subroutine  search(a;i ,  0:2,  [a,  /9])  used  in  DOBA+  solves  the  problem  minegf^  ,g]  f{0). 
We  will  see  that  restricting  the  allowed  interval  using  to  [a,/?]  is  useful  in  interpolating 
with  fictitious  play. 


5.3.4  Convergence  Guarantees  and  Fictitious  Play 

Fictitious  play  (or  a  no-regret  algorithm  in  self-play)  maintains  centers  and 
estimates  of  the  minimax  optimal  implicit  mixed  strategies.  On  each  iteration  FP  updates 
these  centers  in  the  search  direction,  =  BRx(?/'^'”^'^)  and  =  BRy(a;'^"’^'').  DOBA-i- 
has  a  similar  structure:  it  maintains  a  center  for  each  player,  and  on  each  iteration  updates 
these  centers  towards  a  search  direction.  The  algorithm  maintains  a  parameter  <p  (the 
fictitious  play  fraction),  so  that  when  0  =  0  the  algorithm  runs  in  an  unrestricted  fashion, 
while  if  0  =  1,  the  algorithm  behaves  identically  to  fictitious  play. 

The  selection  of  the  search  direction  and  update  of  the  center  occurs  in  th  update- 
Center(x)  method  of  Figure  (5.5);  updateCenter(y)  is  identical,  but  with  the  roles  of  x 
and  y  switched.  The  best  response  to  the  opponent’s  current  center  is  one  possible  search 
direction;  the  solution  to  the  model  game  M  provides  another.  DOBA-i-  does  a  line  search 
between  these  two  possibilities  in  order  to  choose  its  search  direction;  however,  at  least 
0  weight  is  required  to  be  on  the  best  response  to  the  opponent’s  center,  so  that  when 
0=1  DOBA-i-  uses  the  same  search  direction  as  FP.  This  is  accomplished  via  the  call  to 
search(BRx(/"*0,a;'"'^  [0, 1  -  0]). 

Similarly,  we  update  the  center  by  a  line  search  from  towards  but  we  con¬ 
strain  the  interval  of  the  search  to  linearly  interpolate  from  [0, 1]  when  0  =  0,  to  [l/(f  -f- 
+  1)]  when  0  =  1.  The  constants  a  and  0  in  the  call  to  search [a,  0]) 
accomplish  this  interpolation;  when  0  =  1,  we  have  the  fixed  step-size  of  l/(t  -f-  1)  of 
fictitious  play. 

We  can  insure  convergence  of  DOBA-i-  be  updating  0  based  on  the  rate  of  change 
of  (ub  —  lb)  so  that  if  the  rate  drops  lower  than  that  expected  of  FP,  0  eventually  goes 

^In  fact,  for  the  xi  and  X2  used  in  DOBA+,  we  will  need  to  compute  ci  and  C2  anyway  at  some  point, 
and  so  this  computation  can  be  made  effectively  free. 

"^In  many  cases  the  oracle  also  provides  the  value  V (x,  BRy(x"^M))  directly  in  which  case  this  multipli¬ 
cation  is  unnecessary. 
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Figure  5.6:  An  EFG  where  best  responses  ean  be  bad. 

to  1,  and  DOBA+  effeetively  becomes  FP.  An  implementation  can  be  designed  so  that 
once  0  >  0.99  (say),  it  switches  fully  to  FP  and  avoids  the  overhead  of  solving  M  and 
performing  the  line  searches. 

Note  that  DOBA+  is  asynchronous:  the  update  for  x  is  done  first,  and  then  the  update 
for  y  takes  into  account  x’s  updated  This  is  commonly  done  in  FP  implementations  as 
well.  We  ran  experiments  with  several  simple  methods  for  updating  0;  generally  these  had 
little  impact  of  the  runtime  of  the  algorithm.  To  avoid  conflating  the  impact  of  the  0  updat¬ 
ing  scheme  with  the  performance  of  our  unconstrained  approach,  we  present  experimental 
results  with  0  fixed  at  0. 


5.4  Good  and  Bad  Best  Responses  for  Extensive-form 
Games 

To  paraphrase  George  Orwell,  “A//  best  responses  are  equal  but  some  best  responses  are 
more  equal  than  others.’'"  We  now  investigate  this  notion. 

Bad  best  responses  Consider  the  EFG  of  Figure  (5.6).  Suppose  player  x  is  the  active 
player  at  state  si;  she  can  either  select  a  constant  payoff  of  1  by  choosing  action  oq,  or 
choose  one  of  the  other  actions  a*,  each  of  which  leads  to  a  different  subgame  Gj.  If 
X  fixes  the  policy  x  that  that  picks  oq  with  probability  1,  then  any  policy  for  player  y 
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is  a  best  response  to  x.  Clearly,  an  arbitrary  strategy  might  do  very  poorly  if  player  x 
happens  to  play  some  action  other  than  oq.  In  general,  if  a  behavior  strategy  x  rules  out 
any  information  sets  for  y,  then  a  best-response  behavior  policy  for  y  can  specify  arbitrary 
actions  at  those  ruled  out  information  sets. 

This  could  cause  problems  for  the  double  oracle  algorithm.  Given  on  some  it¬ 
eration  of  the  algorithm,  we  will  compute  y  =  BRy((a;'^'^)^M)  and  add  y  to  By.  Future 
strategies  for  y  will  then  be  constructed  as  distributions  over  strategies  in  the  bundle.  Sup¬ 
pose  x'^'^  happens  to  only  ever  play  action  oi  from  si  (in  Figure  (5.6)).  Then,  the  response 
y  will  likely  be  “good”  for  the  subgame  Gi,  but  terrible  at  all  the  other  subgames  .  This 
means  y  will  be  of  limited  use  in  constructing  a  good  mixed  strategy,  assuming  x  starts 
playing  actions  other  than  ai  from  si  (for  example,  might  play  many  different  actions 
at  Si). 

In  general,  it  is  possible  that  x'^'^,  which  is  produced  by  solving  the  matrix  game  M, 
might  be  a  mixture  of  relatively  few  deterministic  best  responses,  and  so  it  might  rule  out 
many  information  sets  for  y.  We  call  such  problematic  policies  “sparse,”  because  they  put 
zero  probability  on  some  actions,  hence  making  large  parts  of  the  game  tree  unreachable. 
This,  in  turn,  will  produce  cost  vectors  c  =  x'^M  that  are  sparse  in  the  usual  since  of 
having  many  zero  entries.  We  refer  to  policies  that  play  all  actions  with  some  (possibly 
small)  positive  probability  as  dense  strategies. 

We  would  like  y’s  best  response  to  “fall  back”  on  some  reasonable  behavior  when 
selecting  a  best-response  behavior  at  an  unreachable  information  set.  We  can  achieve 
this  in  the  following  manner:  suppose  is  a  dense  strategy  for  x.  Then,  instead  of 
computing  BRy(a;)  directly,  we  compute  y  =  BRy((l  —  e)x  +  ex'^'”^'')  for  some  small 
constant  e.  The  strategy  y  will  still  be  an  arbitrarily  good  response  to  x  (as  e  goes  to  zero), 
but  at  any  information  set  that  isn’t  reached  by  x,  it  will  play  some  best-response  action  to 
since  doesn’t  rule  out  any  information  sets. 

We  address  these  issues  in  DOBA-i-  in  the  following  way: 

•  Our  implementation  of  the  best-response  oracle  for  EFGs  picks  the  uniform  distri¬ 
bution  over  all  best-response  actions  at  each  information  set,  rather  than  just  picking 
one.  This  assures  that  the  strategies  we  add  to  our  bundle  are  as  dense  as  possi¬ 
ble,  and  so  hopefully  our  mixed  strategies  will  not  rule  out  too  many  information 
sets.  It  is  also  possible  to  use  “soft”  best  responses  that  put  positive  probability  on 
all  behaviors.  Investigating  such  approximate  best-response  oracles  is  an  important 
avenue  for  future  work. 

•  When  we  solve  the  matrix  game  M,  we  add  an  additional  constraint  that  requires  the 
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mixtures  p  and  q  to  put  some  small  weight  (we  use  e  =  0.0001)  on  the  current  best 
strategy  for  each  player  (that  is,  on  the  and  corresponding  to  the  current 
ub  and  lb).  These  centers  tend  to  be  dense,  as  they  are  mixtures  of  many  strategies, 
and  so  this  ensures  that  when  we  compute  a  best  response  to  the  best  response 
will  at  worst  fall  back  to  playing  a  best  response  to 

We  have  seen  that  some  best-response  strategies  are  worse  than  others;  we  now  con¬ 
sider  the  other  side  of  the  spectrum,  considering  superior  best  responses.  DOBA-i-  does 
not  currently  use  the  concepts  in  the  next  section,  but  we  hope  to  exploit  these  ideas  algo¬ 
rithmically  in  future  work. 

Optimal  best  responses  The  problem  of  solving  the  convex  game  (X,  Y,  M)  can  be 
written  as 

min  V  (x) 

x&X 

where  V (x)  is  the  convex  function 

V(x)  =  max  x^Mp. 

yGY 

The  best-response  problem  for  a  fixed  x  e  X  is  simply  to  compute  V  (x)  (and  a  y  that 
achieves  this  value).  But,  as  we  saw  in  the  previous  section,  there  may  be  many  possible 
best  responses.  Any  convex  combination  of  best  responses  is  also  a  best  response,  and  so 
the  set  of  all  best  responses  to  x  is  convex:  call  this  set  Ybr{x),  that  is, 

Ybr{x)  =  {y  eY  \  y  is  a  best  response  to  x}. 

We  define  the  optimal  best  response  as  the  solution  to  the  game  where  y  is  restricted  to 
play  from  the  set  Ybr{x),  but  x  can  play  arbitrarily  (in  particular,  she  is  under  no  obli¬ 
gation  to  actually  play  x).  This  is  exactly  the  restriction  of  the  original  convex  game  to 
(X,  Ybr{x),  M),  which  has  value 

max  min  x'^ My . 

y^Ybrix)  X&X 

Let  BR*  compute  an  optimal  best  response.  Computing  an  optimal  best  response  is  in 
general  as  hard  as  solving  an  arbitrary  convex  game.  For  example,  consider  the  game  of 
Figure  (5.6).  Computing  the  optimal  best  response  for  player  y  to  the  policy  for  x  that 
always  plays  Oq  requires  finding  a  minimax  optimal  policy  for  Gi,  G2,  and  all  the  other 
possible  subgames.  However,  in  some  cases  computing  an  optimal  best  response  can  be 
much  easier  than  solving  the  overall  game. 
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The  set  Ybr(a^)  for  an  extensive-form  game  has  a  straight-forward  interpretation:  it 
eorresponds  to  the  set  of  behavior  strategies  that  only  put  positive  probability  on  aetions 
that  are  loeal  best  responses  to  x  (or,  more  preeisely,  aetions  that  are  loeal  best  responses 
to  the  value  funetion  on  player  y’s  information  set  tree  indueed  by  the  strategy  x).  Thus, 
we  ean  effieiently  optimize  any  linear  funetion  over  the  set  Ybr(a:)  by  using  the  standard 
linear-time  dynamie  programming  algorithm  for  eomputing  a  best  response  for  an  EFG; 
we  simply  run  the  algorithm  on  the  restrieted  information  set  tree  where  we  have  disearded 
all  aetions  that  are  not  loeal  best  responses. 

Optimal  best-response  strategies  have  several  niee  properties.  It  is  easy  to  verify  that 
an  optimal  best  response  y*  =  BRy(a;*)  to  a  minimax  strategy  x*  is  a  minimax  optimal 
strategy  for  y.  It  ean  also  be  shown  that  for  an  arbitrary  x,  eomputing  an  optimal  best 
response  to  x  eorresponds  to  finding  a  direetion  of  steepest  feasible  deeent  with  respeet  to 
V{x)  from  x:  we  ean  think  of  y'  =  BRy(a;)  as  defining  the  best  (that  is,  minimal  slope) 
mixture  of  subgradients  at  x. 


5.5  Experimental  Results 

In  this  seetion  we  test  the  algorithms  introdueed  on  both  the  sensor-plaeement  /  observation- 
avoidanee  adversarial  MDP  problem,  and  on  an  extensive-form  game  representations  of 
Rhode  Island  Hold’em  poker  and  approximated  Texas  Hold ’em  poker. 


5.5.1  Adversarial-cost  MDPs 

We  eonsider  the  example  sensor  plaeement  /  avoidanee  game  deseribed  in  Seetion  3.4.2. 
We  model  the  robot’s  path  planning  problem  by  diseretizing  a  given  map  at  a  resolution 
of  between  10  and  50  em  per  eell,  produeing  grids  of  size  269  x  226  to  54  x  45.  We  do 
not  model  the  robot’s  orientation  and  rely  on  lower  level  navigation  software  to  move  the 
robot  along  planned  trajeetories.  Eaeh  eell  eorresponds  to  a  state  in  the  MDP. 

The  transition  model  we  use  gives  the  robot  16  aetions,  eorresponding  to  movement  in 
any  of  16  eompass  direetions.  Movement  in  the  direetions  N,  S,  E,  and  W  eorresponds  to 
moving  to  an  adjaeent  eell,  NE,  SE,  SW,  and  NE  eorrespond  to  moving  to  an  adjaeent  eell 
diagonally,  and  the  other  eight  direetions  (NNE,  ete),  eorrespond  to  moving  two  eells  in 
one  direetion,  and  one  eell  in  an  orthogonal  direetion.  Allowing  movement  in  16  direetions 
means  distanees  in  the  diseretized  world  approximate  a  Euelidean  distanee  metrie,  rather 
than  the  Manhattan  (El)  metrie  implied  by  only  allowing  movement  in  the  4  eardinal 
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grid  size 

k 

EP 

basic  DOBA 

iter 

A 

54x45 

32 

56.8  s 

1.9  s 

15 

B 

54x45 

328 

104.2  s 

8.4  s 

47 

C 

94x79 

136 

2835.4  s 

10.5  s 

30 

D 

135  X  113 

32 

1266.0  s 

10.2  s 

14 

E 

135  X  113 

92 

8713.0  s 

18.3  s 

30 

E 

269  X  226 

16 

- 

39.8  s 

17 

G 

269  X  226 

32 

- 

41.1  s 

15 

Table  5.1:  Sample  problem  discretizations,  number  of  sensor  placements  available  to 
the  opponent,  solution  time  solving  Equation  (5.4)  with  CPLEX,  and  solution  time  and 
number  of  iterations  using  the  basic  Double  Oracle  Algorithm. 


directions.  Each  cell  s  has  a  cost  weight  m{s)  for  movement  through  that  cell;  in  our 
experiments  all  of  these  are  set  to  1.0  for  simplicity.  The  actual  movement  costs  for  each 
action  are  calculated  by  considering  the  distance  traveled  (either  1,  y/2,  or  y/5)  weighted 
by  the  movement  costs  assigned  to  each  cell.  Eor  example,  movement  in  one  of  the  four 
cardinal  directions  from  a  state  m  to  a  state  v  incurs  cost  0.5m{u)  +  0.5m(n). 

Cells  observed  by  a  sensor  have  an  additional  cost  given  by  a  linear  function  of  the 
distance  to  the  sensor.  An  additional  cost  of  20  is  incurred  if  observed  by  an  adjacent 
sensor,  and  cost  10  is  incurred  if  the  sensor  is  at  the  maximum  distance.  The  ratio  of 
movement  cost  to  observation  cost  determines  the  planner’s  relative  preference  for  paths 
with  low  expected  observation  times  versus  short  paths.  We  assume  a  fixed  start  location 
for  our  robot  in  all  problems,  so  pure  strategies  can  be  represented  as  paths. 

We  present  experiments  using  both  the  single  oracle  algorithm  and  the  basic  double 
oracle  algorithm  on  this  domain.  The  basic  DOBA  was  sufficient  because  at  most  30 
iterations  were  needed  to  solve  our  test  problems.  Both  algorithms  use  Dijkstra’s  algorithm 
for  the  planning  player’s  oracle  BRx. 

The  column  (cost  player)  oracle  for  the  double  oracle  algorithm  is  the  following  naive 
one:  Eor  an  arbitrary  matrix  game  M,  an  oracle  may  be  created  (say  BRy  for  the  column 
player)  by  finding  the  minimum  entry  in  the  vector  M  when  the  row  player  plays  mixed 
strategy  p.  Perhaps  surprisingly,  we  show  that  even  using  such  a  naive  oracle  can  yield 
performance  improvements  over  the  single  oracle  algorithm.  If  sensor  fields  of  view  have 
limited  overlap,  then  a  fast  best-response  oracle  for  multiple  sensor  placement  can  also  be 
constructed  by  considering  each  sensor  independently  and  using  the  result  to  bound  the 
cost  vector  for  a  pair  of  sensors 
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Our  implementation  is  in  Java,  with  an  external  call  to  CPLEX  7.1  [ILOG,  Inc.,  2003] 
for  solving  all  linear  programs.  For  comparison,  we  also  used  CPLEX  to  solve  the  linear 
program  (5.4)  directly  (without  any  decompositions). 

All  results  given  in  this  paper  correspond  to  the  map  in  Figure  (3.4)  (found  in  Chap¬ 
ter  3).  We  performed  experiments  on  other  maps  as  well,  but  we  do  not  report  them 
because  the  results  were  qualitatively  very  similar.  We  solved  the  problem  using  various 
discretizations  and  different  numbers  of  potential  cost  vectors  to  demonstrate  the  scaling 
properties  of  our  algorithms.  These  problem  discretizations  are  shown  in  Table  (5.5.1), 
along  with  their  double  oracle  and  direct  LP  solution  times.  The  letters  in  the  table  corre¬ 
spond  to  those  in  Figure  (5.7),  which  compares  the  double  and  single  oracle  algorithms. 
All  times  are  wall-clock  times  on  a  1  GHz  Pentium  III  machine  with  512M  main  memory. 
Results  reported  are  the  average  over  5  runs.  Standard  deviations  were  insignificant,  less 
than  1/lOth  of  total  solve  time  in  all  cases. 

Our  results  indicate  that  both  the  double  and  single  oracle  algorithms  significantly  out¬ 
perform  directly  solving  the  linear  program.  This  improvement  in  performance  is  possible 
because  our  algorithms  take  advantage  of  the  fact  that  the  linear  program  (5.4)  is  “almost” 
an  MDP,  and  the  planner’s  row  oracle  is  implemented  with  Dijkstra’s  algorithm,  which 
is  much  faster  than  general  LP  solvers.  The  particularly  lopsided  times  for  problems  C, 
D,  and  E  may  have  been  partially  caused  by  CPLEX  running  low  on  physical  memory; 
we  didn’t  try  solving  the  LPs  for  problems  F  and  G  because  they  are  even  larger.  One  of 
the  benefits  of  our  decomposition  algorithms  is  their  lower  memory  usage,  but  even  when 
memory  was  not  an  issue  our  algorithms  were  significantly  faster  than  CPLEX. 

As  Figure  (5.7)  shows,  the  basic  double  oracle  algorithm  outperforms  the  single  oracle 
version  for  all  problems.  The  difference  is  most  pronounced  on  problems  with  a  large 
number  of  cost  vectors.  The  time  for  solving  the  master  LPs  and  for  the  column  oracle 
are  insignificant,  so  the  performance  gained  by  the  double  oracle  algorithm  is  explained 
by  its  implicit  preference  for  mixed  strategies  with  small  support,  and  the  correspondingly 
smaller  M. 

We  ran  the  oracle  algorithms  with  e  =  0.005,  which  is  effectively  optimal  considering 
that  a  single  step  of  movement  has  cost  1.0,  and  the  minimum  cost  for  being  observed  is 
10.0.  Thus,  assuming  the  model  expressed  by  the  linear  program  is  accurate,  our  algo¬ 
rithms  produce  the  best  result  possible.  In  practice,  it  may  be  that  the  costs  need  to  be 
adjusted  to  obtain  the  desired  result;  for  example,  in  our  path-planning  problem  there  is  a 
trade-off  between  shortest  paths  and  being  observed.  By  adjusting  the  balance  between  the 
opponent-controlled  and  fixed  movement  costs  of  the  problem,  the  algorithm  can  be  made 
to  weigh  this  trade-off  differently.  Finding  the  proper  balance  for  a  particular  problem  may 
require  some  tweaking  of  model  parameters. 


149 


140 


120 

100 

§  “ 
_c 

0)  60 

E 

i- 

40 

20 

0 


Figure  5.7:  Double  and  single  oraele  algorithm  performanee  on  problems  shown  in  Ta¬ 
ble  (5.5.1). 
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We  also  ran  some  limited  experiments  in  the  online  setting,  using  the  no-regret  algo¬ 
rithm  of  Kalai  and  Vempala  [2003]  to  play  the  repeated  game  as  diseussed  in  Seetion  3.1.2. 
Figure  (5.8)  shows  100  trajeetories  against  an  opponent  who  plays  a  minimax  optimal  sen¬ 
sor  plaeement:  that  is,  for  eaeh  round  the  adversary  samples  a  sensor  plaeement  c  E  K 
aeeording  to  a  minimax  optimal  distribution  q.  The  Kalai- Vempala  algorithm  ehooses  a 
best  response  to  the  average  eost  veetor  played  by  the  adversary  so  far,  perturbed  with 
a  small  amount  of  randomness.  The  implementation  is  straightforward,  and  the  gener¬ 
ated  paths  are  reasonable.  The  random  perturbations  to  the  eost  veetors  introdueed  by  the 
Kalai- Vempala  algorithm  tend  to  produee  paths  that  are  somewhat  ehoppy,  but  these  paths 
ean  be  smoothed  by  lower- level  eontrol  routines.  As  expeeted,  the  algorithm  eonverges  to 
a  best  response  to  the  minimax  optimal  q  played  by  the  adversary. 

For  these  MDP-based  problems,  there  is  the  possibility  for  a  very  large  speedup 
through  the  use  of  more  sophistieated  best-response  oraeles  for  the  planning  player:  in  our 
experiments,  the  planner’s  oraele  ealls  typieally  take  90%  or  more  of  the  total  runtime.  We 
used  Dijkstra’s  algorithm  to  solve  our  deterministie  best-response  problem.  However,  it  is 
straightforward  to  eonstruet  reasonable  heuristies  for  this  problem,  for  example  by  using 
the  LI  distanee  for  the  movement  eosts  plus  observation  eosts.  Thus,  we  eould  apply  the 
A*-seareh  algorithm.  Further,  there  is  likely  to  similarity  between  the  optimal  solutions 
from  one  round  to  the  next.  To  take  advantage  of  this,  we  eould  use  an  ineremental  A* 
implementation,  for  example  that  of  Koenig  and  Likhaehev  [2001]. 
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Figure  5.8:  Convergence  to  optimal  response  in  the  online  setting.  The  thin  lines  indicate 
100  trajectories  produced  by  the  Kalai-Vempala  algorithm  against  a  fixed  minimax  solu¬ 
tion  of  the  opponent.  The  wider  lines  indicate  a  minimax  solution  for  the  planner.  The 
erratic  nature  of  the  Kalai-Vempala  paths  is  caused  by  the  randomness  in  the  cost  vectors 
introduced  by  that  algorithm:  the  small  jogs  in  the  path  are  caused  by  the  robot  driving 
around  small  areas  where  a  higher  cost  has  been  hallucinated. 


5.5.2  Extensive-form  Game  Experiments 

We  tested  the  DOBA-i-  algorithm  and  fictitious  play  on  abstracted  Rhode  Island  Hold’ cm. 
We  chose  this  as  a  representative  problem  both  because  Rhode  Island  Hold’em  is  a  well- 
known  AI  testbed  and  because  the  abstracted  version  is  one  of  the  largest  extensive-form 
games  that  can  be  solved  in  a  reasonable  amount  of  time  using  CPLEX’s  barrier  method 
implementation  on  a  modern  workstation;  the  game  was  first  solved  by  Gilpin  and  Sand- 
holm  [2005].  CPLEX’s  performance  provides  one  benchmark  against  which  to  evaluate 
our  results. 

Rhode  Island  Hold’em  (RIH)  was  introduced  by  Shi  and  Littman  [2001]  as  a  challenge 
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Figure  5.9:  Algorithm  runtime  versus  approximation  error  on  Rhode  Island  hold’em.  The 
V  axis  is  a  log  scale  plot  of  e  for  the  best  approximate  solution  the  algorithms  can  return 
at  a  given  time,  in  units  of  $0.01. 


problem  for  AI  research.  The  game  is  similar  to  two-player  limit  Texas  Hold’em.  It  is 
played  with  a  full  deck  of  52  cards,  but  each  player  receives  only  a  single  face-down  hole 
card,  and  there  are  only  two  community  cards.  There  are  three  rounds  of  betting,  with 
up  to  three  raises  per  betting  round.  Unabstracted  Rhode  Island  Hold’em  has  a  game  tree 
with  3.1  billion  nodes,  which  is  still  too  large  to  work  with  conveniently.  Instead,  Andrew 
Gilpin  was  kind  enough  to  provide  us  with  the  convex  game  representation  produced  by  the 
GameShrink  algorithm  [Gilpin  and  Sandholm,  2005].  Sparsely  represented,  this  game  has 
approximately  50  x  10®  non-zeros  in  the  payoff  and  sequence  constraint  matrices,  with 
dimensions  m  =  n  =  883,  741,  taking  almost  600MB  of  memory  to  store.  A  solution 
to  this  game  can  be  converted  to  a  payoff-equivalent  strategy  for  the  unabstracted  game. 
The  poker  game  has  $5.00  antes  and  a  maximum  pot  size  of  $310.00.  The  uniform  random 
strategy,  from  which  we  started  both  our  algorithm  and  fictitious  play,  loses  approximately 
$290.00  per  game.  The  minimax  value  of  the  game  is  —$0.64;  the  value  is  negative  because 
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Figure  5.10:  Algorithm  runtime  versus  approximation  error  on  approximately  abstraeted 
Texas  hold’em.  Note  that  with  both  bundle  sizes,  DOBA+  signifieantly  outperforms  both 
versions  of  fietitious  play. 


player  x  (the  minimizing  player)  bets  seeond,  and  thus  gains  a  small  advantage  based  on 
the  information  revealed  by  the  first  player’s  initial  bet. 

The  CPLEX  eommereial  linear  programming  paekage  solved  abstraeted  Rhode  Island 
Hold’em  via  the  barrier  method  in  about  7.5  days,  using  25  GB  of  memory;  aehieving  an 
e  =  $0.20  approximate  minimax  solution  took  1 10.3  hours,  or  over  4.5  days.  The  DOBA+ 
produeed  a  solution  of  that  quality  in  130  minutes;  see  Figure  (5.11). 

Figure  (5.9)  eompares  the  anytime  performanee  of  DOBA+  and  fietitious  play  (FP). 
Both  algorithms  were  initialized  using  the  uniform-random  behavior  strategy  for  both 
players;  that  is,  at  eaeh  information  set  the  agent  seleets  an  aetion  uniformly  at  random. 
Espeeially  early  on,  DOBA-i-  ean  produee  higher  quality  solutions  for  a  given  amount 
of  time.  For  example,  it  takes  DOBA-i-  11  minutes  to  bound  the  value  of  the  game  in 
[$0.00,  —$1.35],  thereby  proving  that  player  x  has  an  advantage.  It  takes  synehronous  FP 
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Figure  5.11:  Comparison  of  runtimes  for  CPLEX  and  the  double  oracle  Bundle  Algorithm 
(DOBA+)  to  produce  an  e  =  $0.20  and  e  =  $0.73  solution  to  abstracted  RIH.  Note  that 
the  runtimes  are  on  a  log  scale. 


(as  in  Figure  (5.1))  about  20  minutes  to  get  comparable  bounds,  but  asynchronous  FP^ 
takes  only  10  minutes  to  get  these  bounds.  Experiments  on  smaller  approximately  ab¬ 
stracted  versions  of  Rhode  Island  Hold’em,  however,  show  DOBA-i-  can  outperform  FP  by 
an  order  of  magnitude;  on  the  smallest  problem  we  tested  (with  m  =  n  =  10^),  CPLEX 
outperformed  both  algorithms. 

Figure  (5.10)  shows  results  for  experiments  on  an  approximately  abstracted  instance 
of  Texas  Hold’em,  similar  to  those  discussed  in  [Gilpin  and  Sandholm,  2006a].  This 
instance  only  models  the  first  three  rounds  of  betting.  This  problem  has  significantly 
more  non-zeros  (130  x  10®)  than  the  Rhode  Island  hold’em  instance,  and  requires  about 
1.5GB  of  memory  to  represent.  However,  this  problem  has  lower  dimensionality,  with 
m  =  n  =  236,416.  The  instance  we  tested  has  a  small  blind  (similar  to  an  ante)  of 

^Asynchronous  FP  does  the  updates  for  one  player,  say  x,  before  the  updates  for  the  other  player  on 
each  iteration.  That  is,  asynchronous  FP  is  the  algorithm  of  Figure  (5.1)  where  all  the  statements  in  the 
left  column  in  the  while  loop  are  executed  before  the  statements  in  the  right  column.  Note  that  DOBA+ 
with  (j)  fixed  at  1 .0  effectively  executes  asynchronous  FP.  For  many  problems,  solving  the  small  matrix  game 
takes  a  negligible  amount  of  time,  and  so  making  DOBA+  fully  asynchronous  by  re-solving  the  matrix  game 
between  each  call  to  updateCenter  may  be  advantageous  [Krause,  2006]. 
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$0.50,  and  a  big  blind  of  $1.00.  For  this  problem,  DOBA+  significantly  outperformed 
asynchronous  fictitious  play:  FP  bounded  the  value  of  the  game  in  [—0.028,  —0.046]  in  a  2 
hour  run,  while  DOBA+  achieved  better  bounds  in  less  than  6  minutes.  We  used  DOBA+ 
with  0  =  1  as  our  asynchronous  FP  implementation,  and  this  incurred  extra  overhead. 
A  direct  implementation  might  allow  a  2-3x  performance  increase,  but  clearly  this  would 
have  made  little  difference  for  the  Texas  Hold’em  instance. 

We  conclude  that  none  of  these  algorithms  are  a  clear  winner  even  when  only  consid¬ 
ering  extensive-form  games,  and  so  the  general  problem  of  designing  fast  algorithms  for 
all  types  of  convex  games  is  still  very  much  open. 
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stances. 
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Chapter  6 

Online  Geometric  Optimization  in  the 
Bandit  Setting 


In  this  chapter  we  deseribe  a  new  algorithm  for  solving  online  linear  programming  prob¬ 
lems  in  the  bandit  setting  when  faeing  an  adaptive  adversary;  this  work  originally  appeared 
in  [MeMahan  and  Blum,  2004].  Though  it  has  many  potential  applieations,  the  algorithm 
we  develop  is  partieularly  applicable  to  the  ease  of  a  eonvex  game  (X,  Y,  M)  that  is  played 
repeatedly  by  player  x,  in  the  manner  diseussed  in  Seetion  3.1.2.  The  approaeh  here  is 
novel  in  that  it  does  not  require  observation  of  the  opponent’s  strategy  y  on  eaeh  iteration, 
or  even  the  eost  veetor  My.  Rather,  we  ean  run  the  algorithm  of  this  chapter  for  player  x 
as  long  as  the  total  payoff  x'^My  is  observed  on  each  round. 


6.1  Introduction  and  Background 

Kalai  and  Vempala  [2003]  give  an  elegant,  efficient  algorithm  for  a  broad  class  of  online 
optimization  problems.  In  their  setting,  we  have  an  arbitrary  (bounded)  set  S  C  of 
feasible  points.  At  eaeh  time  step  t,  an  online  algorithm  A  must  seleet  a  point  x*  G  S' 
and  simultaneously  an  adversary  seleets  a  eost  veetor  A  E  W  (throughout  the  chapter 
we  use  superseripts  to  index  iterations).  The  algorithm  then  observes  c*  and  incurs  cost 
c*  ■  X*.  Kalai  and  Vempala  show  that  so  long  as  we  have  an  effieient  algorithm  for  the 
offline  problem  (given  c  G  M”  find  x  G  S'  to  minimize  c  ■  x)  and  so  long  as  the  eost  veetors 
are  bounded,  we  ean  effieiently  solve  the  online  problem  of  performing  nearly  as  well  as 
the  best  fixed  x  G  S'  in  hindsight.  This  generalizes  the  elassie  “expert  adviee”  problem, 
beeause  we  do  not  require  the  set  S  to  be  represented  explieitly:  we  just  need  an  efficient 
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oracle  for  selecting  the  best  x  G  5  in  hindsight.  Further,  it  decouples  the  number  of  experts 
from  the  underlying  dimensionality  n  of  the  decision  set,  under  the  assumption  the  cost  of 
a  decision  is  a  linear  function  of  n  features  of  the  decision.  The  standard  experts  setting 
can  be  recovered  by  letting  S  =  {oi, . . . ,  e^},  the  columns  of  the  n  x  n  identity  matrix. 

A  problem  that  fits  naturally  into  this  framework  is  an  online  shortest  path  problem 
where  we  repeatedly  travel  between  two  points  a  and  b  in  some  graph  whose  edge  costs 
change  each  day  (say,  due  to  traffic).  In  this  case,  we  can  view  the  set  of  paths  as  a  set 
S  of  points  in  a  space  of  dimension  equal  to  the  number  of  edges  in  the  graph,  and  c*  is 
simply  the  vector  of  edge  costs  on  day  t.  Even  though  the  number  of  paths  in  a  graph 
can  be  exponential  in  the  number  of  edges  (i.e.,  the  set  S  is  of  exponential  size),  since 
we  can  solve  the  shortest  path  problem  for  any  given  set  of  edge  lengths,  we  can  apply 
the  Kalai-Vempala  algorithm.  (Note  that  a  different  algorithm  for  the  special  case  of  the 
online  shortest  path  problem  is  given  by  Takimoto  and  Warmuth  [2002].) 

A  natural  generalization  of  the  above  problem,  considered  by  Awerbuch  and  Kleinberg 
[2004],  is  to  imagine  that  rather  than  being  given  the  entire  cost  vector  c^  the  algorithm 
is  simply  told  the  cost  incurred  c*  ■  x*.  For  example,  in  the  case  of  shortest  paths,  rather 
than  being  told  the  lengths  of  all  edges  at  time  t,  this  would  correspond  to  just  being 
told  the  total  time  taken  to  reach  the  destination.  Thus,  this  is  the  “bandit  version”  of  the 
Kalai-Vempala  setting.  Awerbuch  and  Kleinberg  present  two  results:  an  algorithm  for  the 
general  problem  in  the  presence  of  an  oblivious  adversary,  and  an  algorithm  for  the  special 
case  of  the  shortest  path  problem  that  works  in  the  presence  of  an  adaptive  adversary.  The 
difference  between  the  two  adversaries  is  that  an  oblivious  adversary  must  commit  to  the 
entire  sequence  of  cost  vectors  in  advance,  whereas  an  adaptive  adversary  may  determine 
the  next  cost  vector  based  on  the  online  algorithm’s  play  (and  hence,  the  information  the 
algorithm  received)  in  the  previous  time  steps.  Thus,  an  adaptive  adversary  is  in  essence 
playing  a  repeated  game.  They  leave  open  the  question  of  achieving  good  regret  guarantees 
for  an  adaptive  adversary  in  the  general  setting. 

In  this  chapter,  we  solve  the  open  question  of  [Awerbuch  and  Kleinberg,  2004],  giving 
an  algorithm  for  the  general  bandit  setting  in  the  presence  of  an  adaptive  adversary.  More¬ 
over,  our  method  is  significantly  simpler  than  the  special-purpose  algorithm  of  Awerbuch 
and  Kleinberg  for  shortest  paths.  Our  bounds  are  somewhat  worse:  we  achieve  regret 
bounds  of  C>(T^/^VlnT)  compared  to  the  bounds  of  [Awerbuch  and  Kleinberg, 

2004]. 

The  basic  idea  of  our  approach  is  as  follows.  We  begin  by  noticing  that  the  only 
history  information  used  by  the  Kalai-Vempala  algorithm  in  determining  its  action  at  time 
t  is  the  sum  ^  vectors  received  so  far  (we  use  this  abbreviated 

notation  for  sums  over  iteration  indexes  throughout  the  chapter).  Furthermore,  the  way 
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this  is  used  in  the  algorithm  is  by  adding  random  noise  fi  to  this  vector,  and  then  calling 
the  offline  oracle  to  find  the  x*  G  S'  that  minimizes  +  fi)  ■  xS  So,  if  we  can 

design  a  bandit  algorithm  that  produces  an  estimate  of  and  show  that  with 

high  probability  even  an  adaptive  adversary  will  not  cause  to  differ  too  substantially 
from  we  can  then  argue  that  the  distribution  ^  is  close  enough  to 

for  the  Kalai-Vempala  analysis  to  apply.  In  fact,  to  make  our  analysis  a  bit  more  general, 
so  that  we  could  potentially  use  other  algorithms  as  subroutines,  we  will  argue  a  little 
differently.  Let  OPT(c)  =  minxes  (c  ■  x).  We  will  show  that  with  high  probability, 
OPT(c^-^)  is  close  to  OPT(c^-^)  and  satisfies  conditions  needed  for  the  subroutine 
to  achieve  low  regret  on  c^-^.  This  means  that  our  subroutine,  which  believes  it  has  seen 
c^-^,  will  achieve  performance  on  close  to  OPT(c^-^).  We  then  finish  off  by  arguing 
that  our  performance  on  is  close  to  its  performance  on  c^-^. 

The  behavior  of  the  bandit  algorithm  will  in  fact  be  fairly  simple.  We  begin  by  choos¬ 
ing  a  basis  B  of  (at  most)  n  points  in  S  to  use  for  sampling  (we  address  the  issue  of  how 
B  is  chosen  when  we  describe  our  algorithm  in  detail).  Then,  at  each  time  step  t,  with 
probability  7  we  explore  by  playing  a  random  basis  element,  and  otherwise  (with  proba¬ 
bility  1  —  7)  we  exploit  by  playing  according  to  the  Kalai-Vempala  algorithm.  For  each 
basis  element  b^  ,  we  use  our  cost  incurred  while  exploring  with  that  basis  element,  scaled 
by  n/7,  as  an  estimate  of  ■  b^  .  Using  martingale  tail  inequalities,  we  argue  that  even 
an  adaptive  adversary  cannot  make  our  estimate  differ  too  wildly  from  the  true  value  of 
■  bj,  and  use  this  to  show  that  after  matrix  inversion,  our  estimate  is  close  to 
its  correct  value  with  high  probability. 


6.2  Problem  Formalization 

We  can  now  fully  formalize  the  problem.  First,  however,  we  establish  a  few  notational 
conventions.  As  mentioned  previously,  we  use  superscripts  to  index  iterations  (or  rounds) 
of  our  algorithm,  and  use  the  abbreviated  summation  notation  when  summing  variables 
over  iterations.  Vectors  quantities  are  indicated  in  bold,  and  subscripts  index  into  vectors 
or  sets.  Hats  (such  as  c*)  denote  estimates  of  the  corresponding  actual  quantities.  The 
variables  and  constants  used  in  this  chapter  are  summarized  in  Table  (6.1). 

As  mentioned  above,  we  consider  the  setting  of  [Kalai  and  Vempala,  2003]  in  which 
we  have  an  arbitrary  (bounded)  set  S'  C  of  feasible  points.  At  each  time  step  t,  the 
online  algorithm  A  must  select  a  point  x*  G  S'  and  simultaneously  an  adversary  selects  a 
cost  vector  c*  G  The  algorithm  then  incurs  cost  c*  ■  xS  Unlike  [Kalai  and  Vempala, 
2003],  however,  rather  than  being  told  c*,  the  algorithm  simply  learns  its  cost  c*  ■  xS 
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For  simplicity,  throughout  this  chapter  we  assume  a  fixed  adaptive  adversary  V  and 
time  horizon  T.  Since  our  choice  of  algorithm  parameters  depends  on  T,  we  assume^  T  is 
known  to  the  algorithm.  We  refer  to  the  sequence  of  decisions  made  by  the  algorithm  so 
far  as  a  decision  history,  which  can  be  written  /i*  =  [x^, . . . ,  x*].  Let  H*  be  the  set  of  all 
possible  decision  histories  of  length  0  through  T  —  1.  Without  loss  of  generality  [see  Auer 
et  ah,  1995,  for  example],  we  assume  our  adaptive  adversary  is  deterministic,  as  specified 
by  a  function  V  :  H*  ^  W^,  a  mapping  from  decision  histories  to  cost  vectors.  Thus, 
=  c*  is  the  cost  vector  for  timestep  t. 

We  can  view  our  online  decision  problem  as  a  game,  where  on  each  iteration  t  the 
adversary  V  selects  a  new  cost  vector  c*  based  on  h^~^,  and  the  online  algorithm  A  selects 
a  decision  x  G  S'  based  on  its  past  plays  and  observations,  and  possibly  additional  hidden 
state  or  randomness.  Then,  A  pays  c*  ■  x*  and  observes  this  cost.  For  our  analysis,  we 
assume  a  Li  bound  on  S,  namely  ||x||i  <  D/2  for  all  x  G  S',  so  ||x  —  y||i  <  D  for  all 
X,  y  G  S'.  We  also  assume  that  |c  ■  x|  <  M  for  all  x  G  S'  and  all  c  played  by  V.  We  also 
assume  S  is  full  rank:  if  it  is  not  we  simply  project  to  a  lower-dimensional  representation. 
Some  of  these  assumptions  can  be  lifted  or  modified,  but  this  set  of  assumptions  simplifies 
the  analysis. 

For  a  fixed  decision  history  and  cost  history  k'^  =  (c\...,c^),  we  define 
\oss{h'^,k^)  =  ^  randomized  algorithm  A  and  adversary  V,  we 

define  the  random  variable  loss(^,  V)  to  be  loss(/i^, /c^),  where  h'^  is  drawn  from  the 
distribution  over  histories  defined  by  A  and  V,  and  k^  =  (V(/i°), . . . ,  When  it 

is  clear  from  context,  we  will  omit  the  dependence  on  V,  writing  only  loss(^). 

Our  goal  is  to  define  an  online  algorithm  with  low  regret.  That  is,  we  want  a  guarantee 
that  the  total  loss  incurred  will,  in  expectation,  not  be  much  larger  than  the  optimal  strategy 
in  hindsight  against  the  cost  sequence  we  actually  faced.  To  formalize  this,  first  define  an 
oracle  TZ  :  ^  S  that  solves  the  offline  optimization  problem,  TZ{c)  =  argminxg5(c-x). 

We  then  define  OPT(/c^)  =  ■  7^(c^'^).  Similarly,  OPT(^,  V)  is  the  random  variable 

OPT(/c^)  when  k'^  is  generated  by  playing  V  against  A.  We  again  drop  the  dependence 
on  V  and  A  when  it  is  clear  from  context.  Formally,  we  define  expected  regret  as 


E  [loss(A  V)  -  OPT(^,  V)]  =  .E[loss(^,  V)]  -  E 


T 

min  >  (c*  ■  x) 

t=i 


(6.1) 


Note  that  the  i?[OPT(^,  V)]  term  corresponds  to  applying  the  min  operator  separately 
to  each  possible  cost  history  to  find  the  best  fixed  decision  with  respect  to  that  particular 

'One  can  remove  this  requirement  by  guessing  T,  and  doubling  the  guess  each  time  we  play  longer  than 
expected  (see,  for  example,  Theorem  6.4  from  Auer  et  al.  [2002]). 
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Choose  parameters  7  and  e,  where  e  is  a  parameter  of  GEX 
t  =  1 

Fix  a  basis  B  =  {bi, . . . ,  b„}  C  S 

while  playing  do 

Let  X*  =  1  with  probability  7  and  x*  =  0  otherwise 

if  X*  =  0  then 

Select  X*  from  the  distribution  GEX(c\  . . . , 

Incur  cost  ■  x* 

c*  =  0  G  M” 

else 

Draw  j  uniformly  at  random  from  {1, . . . ,  n} 

X*  =  b,- 

Incur  cost  and  observe  ■  x* 

Define  £  by  f  •  =  0  for  f  7^  j  and  =  {nl'y)z^ 

c*  =  {B^)-^£^ 

end  if 

t  =  t  +  l 

end  while 


Figure  6.1:  The  bandit-style  geometric  decision  algorithm  against  an  adaptive  adversary 
(EGA). 

cost  history,  and  then  taking  the  expectation  with  respect  to  these  histories.  Auer  et  al. 
[1995]  give  an  alternative  weaker  definition  of  regret.  We  discuss  relationships  between 
the  definitions  in  Appendix  D. 


6.3  Algorithm 

We  introduce  an  algorithm  we  call  EGA,  standing  for  Bandit-style  Geometric  decision 
algorithm  against  an  Adaptive  adversary.  The  algorithm  alternates  between  playing  deci¬ 
sions  from  a  fixed  basis  to  get  unbiased  estimates  of  costs,  and  playing  (hopefully)  good 
decisions  based  on  those  estimates.  In  order  to  determine  the  good  decisions  to  play,  it  uses 
some  online  geometric  optimization  algorithm  for  the  full-observation  problem.  We  de¬ 
note  this  algorithm  by  GEX  {Geometric  Experts  algorithm).  The  implementation  of  GEX 
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we  analyze  is  based  on  the  FPL  algorithm  of  Kalai  and  Vempala  [2003];  we  detail  this 
implementation  and  analysis  in  Appendix  C.  However,  other  algorithms  could  be  used,  for 
example  the  algorithm  of  Zinkevich  [2003]  when  S  is  convex.  We  view  GEX  as  a  function 
from  the  sequence  of  previous  cost  vectors  (c^, . . . ,  to  distributions  over  decisions. 

Pseudocode  for  our  algorithm  is  given  in  Algorithm  (1).  On  each  timestep,  we  make 
decision  xL  With  probability  (1  —  7),  EGA  plays  a  recommendation  x*  =  x*  G  S'  from 
GEX.  With  probability  7,  we  ignore  x*  and  play  a  basis  decision,  x*  =  b*  uniformly 
at  random  from  a  sampling  basis  B  =  {bi, . . . ,  b„}.  The  indicator  variable  x*  is  1  on 
exploration  iterations  and  0  otherwise. 

Our  sampling  basis  B  is  a  n  x  n  matrix  with  columns  b*  G  S',  so  we  can  write  x  = 
Bw  for  any  x  G  M”  and  weights  w  G  M”.  Eor  a  given  cost  vector  c,  let  i  =  B^c 
(the  superscript  f  indicates  transpose).  This  is  the  vector  of  decision  costs  for  the  basis 
decisions,  so  =  c*  ■  b*.  We  define  i  ,  an  estimate  of  as  follows:  Eet  i  =  0  G  M” 
on  exploitation  iterations.  If  on  an  exploration  iteration  we  play  b^,  then  i  is  the  vector 
where  =  0  for  i  j  and  =  7(0*  ■  b^).  Note  that  c*  ■  b^  is  the  observed  quantity, 

the  cost  of  basis  decision  b^.  On  each  iteration,  we  estimate  c*  by  c*  =  .  It  is 

straightforward  to  show  that  £  is  an  unbiased  estimate  of  basis  decision  costs  and  that  c* 
is  an  unbiased  estimate  of  c*  on  each  timestep  t. 

The  choice  of  the  sampling  basis  plays  an  important  role  in  the  analysis  of  our  algo¬ 
rithm.  In  particular,  we  use  a  barycentric  spanner,  introduced  in  [Awerbuch  and  Kleinberg, 
2004].  A  barycentric  spanner  B  =  {bi,...,b„}isa  basis  for  S  such  that  b*  G  S'  and 
for  all  X  G  S'  we  can  write  x  =  Bw  with  coefficients  Wi  G  [—1, 1].  It  may  not  be  easy 
to  find  exact  barycentric  spanners  in  all  cases,  but  Awerbuch  and  Kleinberg  [2004]  prove 
they  always  exist  and  gives  an  algorithm  for  finding  2-approximate  barycentric  spanners 
(where  the  weights  Wi  G  [—2, 2]),  which  is  sufficient  for  our  purposes. 


6.4  Analysis 

6.4.1  Preliminaries 

At  each  time  step,  EGA  either  (with  probability  1  —  7)  plays  the  recommendation  x*  from 
GEX,  or  else  (with  probability  7)  plays  a  random  basis  vector  from  B.  Eor  purposes  of 
analysis,  however,  it  will  be  convenient  to  imagine  that  we  request  a  recommendation  x* 
from  GEX  on  every  iteration,  and  also  that  we  randomly  pick  a  basis  to  explore,  b*  G 
{bi, . . . ,  b„},  on  each  iteration.  We  then  decide  to  play  either  x*  or  b*  based  on  the 
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set  of  decisions,  a  compact  subset  of  M” 

D  eR 

Li  bound  on  diameter  of  S,  Vx,  yG5,  x  —  yi<Zl 

n  G  N 

dimension  of  decision  space 

/i* 

decision  history  ,  /i*  =  x^ , . . . ,  x* 

H* 

set  of  possible  decision  histories 

V:  H*  ^R'^ 

adversary,  function  from  decision  histories  to  cost  vectors 

A 

an  online  optimization  algorithm 

Gt-i 

history  of  EGA  randomness  for  timesteps  1  through  t  —  1 

c*  G  M"" 

cost  vector  on  time  t 

c*  G  M"- 

EGA’s  estimate  of  the  cost  vector  on  time  t 

M  G  M+ 

bound  on  single-iteration  cost,  c*  •  x*  <  M 

BCS 

sampling  basis  B  =  {bi, . . . ,  b„} 

Poo  £  K. 

matrix  max  norm  on 

£^  G  [-M,  M]^ 

vector,  =  c*  •  b*  for  hi  G  B 

£^  G 

EGA’s  estimate  of 

r  G  N 

end  of  time,  index  of  final  iteration 

x*  G  5 

EGA’s  decision  on  time  t 

SP  e  S 

decision  recommended  by  GEX  on  time  t 

e  {0,1} 

indicator,  x*  =  1  if  EGA  explores  on  t,  0  otherwise 

7  G  [0, 1] 

the  probability  EGA  explores  on  each  timestep 

z*  G  [-M,  M] 

EGA’s  loss  on  iteration  t,  z*  =  c*  •  x*. 

z*  G  [-R,  R] 

loss  of  GEX,  3*  =  c*  •  X* 

Table  6.1:  Summary  of  notation  used  in  the  ehapter. 


outcome  x*  of  a  coin  of  bias  7.  Thus,  the  complete  history  of  the  algorithm  is  specified  by 
the  algorithm  history  =  [x^,  x^,  b^,  x^,  x^,  b^, . . . ,  x*“^  b*“^],  which  encodes 

all  previous  random  choices.  The  sample  space  for  all  probabilities  and  expectations  is 
the  set  of  all  possible  algorithm  histories  of  length  T.  Thus,  for  a  given  adversary  V,  the 
various  random  variables  and  vectors  we  consider,  such  as  x*,  c*,  c*,  x*,  and  others,  can  all 
be  viewed  as  functions  on  the  set  of  possible  algorithm  histories.  Unless  otherwise  stated, 
our  expectations  and  probabilities  are  with  respect  to  the  distribution  over  these  histories. 

A  partial  history  can  be  viewed  a  subset  of  the  sample  space  (an  event)  consisting 
of  all  complete  histories  that  have  as  a  prefix.  We  frequently  consider  conditional 
distributions  and  corresponding  expectations  with  respect  to  partial  algorithm  histories. 
For  instance,  if  we  condition  on  a  history  G^~^,  the  random  variables  . . . ,  c*,  . . . , 

,  c^, . . .  x^, . . . ,  x*“\  and  x^,  •  •  • ,  X*~^  ^re  fully  determined. 

We  now  outline  the  general  structure  of  our  argument.  Let  =  c*  ■  x*  be  the  loss 
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perceived  by  the  GEX  on  iteration  t.  In  keeping  with  earlier  definitions,  loss(BGA)  = 
z^'-^  and  loss(GEX)  =  We  also  let  OPT  =  OPT(BGA,  V)  =  c^'-^  ■  the 

performance  of  the  best  post- hoc  decision,  and  similarly  OPT  =  OPT(c^, . . . ,  c^)  = 

The  base  of  our  analysis  is  a  bound  on  the  loss  of  GEX  with  respect  to  the  cost  vectors 
c*  of  the  form 

E[loss(GEX)]  <  E[OPT]  +  (terms).  (6.2) 

Such  a  result  is  given  in  Appendix  C,  and  follows  from  an  adaptation  of  the  analysis  of 
Kalai  and  Vempala  [2003].  We  then  prove  statements  having  the  general  form 

£^[loss(BGA)]  <  E'[loss(GEX)]  +  (terms)  (6.3) 

and 

E[6Pt]  <  .E[OPT]  +  (terms).  (6.4) 

These  statements  connect  our  real  loss  to  the  “imaginary”  loss  of  GEX,  and  similarly 
connect  the  loss  of  the  best  decision  in  GEX’s  imagined  world  with  the  loss  of  the  best 
decision  in  the  real  world.  Combining  the  results  corresponding  to  Equations  (6.2),  (6.3), 
and  (6.4)  leads  to  an  overall  bound  on  the  regret  of  BGA. 

6.4.2  High  Probability  Bounds  on  Estimates 

We  prove  a  bound  on  the  accuracy  of  BGA’s  estimates  £  ,  and  use  this  to  show  a  relation¬ 
ship  between  OPT  and  OPT  of  the  form  in  Equation  6.4. 

Define  random  variables  e°  =  0  and  e*  =  £^  —  £  .  We  are  really  interested  in  the 
corresponding  sums  e^'*,  where  e}''^  is  the  total  error  in  our  estimate  of  ■  bj.  We  now 
bound 

Theorem  6.4.1.  For  A  >  0, 

Pr  >  A  — Vtl  < 

.  7  . 

Proof.  It  is  sufficient  to  show  the  sequence  e°,  e^,  e^-^,  . . . ,  of  random  vari¬ 
ables  is  a  bounded  martingale  sequence  with  respect  to  the  filter  . . . ,  that  is, 

E[el'-^  I  G*“^]  =  The  result  then  follows  from  Azuma’s  Inequality  [see  Motwani 

and  Raghavan,  1995,  for  example]). 
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First,  observe  that  e}''^  =  ~  Further,  the  eost  veetor  c*  is  determined  if 

we  know  G*~^,  and  so  £j  is  also  fixed.  Thus,  aeeounting  for  the  ^  probability  we  explore 
a  partieular  basis  deeision  bj,  we  have 


E  \e\-^  I  G^-^]  =  ^ 

-  *  -I  /K) 


^  nt 

-£\  +  e 
7 


+  1 


(l  -  ^)  K  -  0  +  ef--] 


=  e 


l:t-l 


and  so  we  eonelude  that  the  e\'^  forms  a  martingale  sequenee.  Notiee  that  \e\'^  —  = 

\£\  —  l\\.  If  we  don’t  sample,  =  0  and  so  \e]''^  —  <  M.  If  we  do  sample,  we  have 

£\  =  and  so  |ej'*  —  This  bound  is  worse,  so  it  holds  in  both  oases.  The 

result  now  follows  from  Azuma’s  inequality.  □ 


Let/9oo  =  ll(-B^)  ^lloo,  amatrixLoo-normon  (i?f)  sothatforany  w,  ||(i?f)  ^w||oo  < 

/5oo  ||w||oo. 


Corollary  6.4.2.  For  5  G  (0, 1],  and  all  t  from  1  to  T, 


Pr 


-  c^-*||oo  >  PocJ{S,'j)Vi 


<  S. 


where  7(5,7)  =  ln(2n/5). 

Proof  Solving  5/n  =  yields  A  =  a/2  ln(2n/5),  and  then  using  this  value  in 

Theorem  (6.4.1)  gives 


Pr 


for  all  i  G  {1, 2, . . . ,  n}.  Then, 


|e/|  >  J(5, 7)/f  <  5/n. 


Pr 


|e^'‘||oo  >  7(5,7)/t  <  ^Pr  |e--*|  >  J(5, 7)/t 

i=l 

<  5 


by  the  union  bound.  Now,  notiee  that  we  oan  relate  i  and  by 


T=1  T=1 


T=1 
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and  similarly  for  and  Then 


Pr 


>  (3ooJ{5,'^)Vt 


=  Pr 

<  Pr 

=  Pr 

<  5. 


/5oo||e^^*||oo  >  pooJ{S,'j)Vi 


oo 


>  J{5,'y)Vt 


□ 

We  can  now  prove  our  main  result  for  the  section,  a  statement  of  the  form  of  Equa¬ 
tion  (6.4)  relating  OPT  and  OPT: 

Theorem  6.4.3.  If  we  play  V  against  BGAfor  T  timesteps, 

E[6pT]  <  E[OPT]  -t-  (1  -  (i)  Qd/3ooJ(5,7)Vt)  +  5MT. 

Proof.  Let  $  =  —  c^'^.  By  definition  of  TZ,  TZ{if'''^)  ■  or 

equivalently  +  <P)  ■  (c^-^  +  *^’)  <  7^(c^-^)  ■  (c^-^  +  $),  and  so  by  expanding  and 

rearranging  we  have 

+  $)  ■  +  $))■$ 

<  ^ll^lloo-  (6.5) 

Then, 

I  OPT  -6pT|  =  |7^(c^^'^)  ■  -  7^(c^^^  +  $)  ■  (c^^^  +  $)  | 

<  |(7^(c^^'^)  -  +  $))  ■  +  |7^(c^^^  +  $)■$! 

<  (D  +  D/2)||$|U, 

where  we  have  used  Equation  (6.5).  Recall  from  Section  (6.2),  we  assume  ||x||  i  <  D /2  for 
all  X  G  S',  so  ||x  —  y||i  <  D  for  all  x,  y  G  S'.  The  theorem  follows  by  applying  the  bound 
on  $  given  by  Corollary  (6.4.2),  and  then  observing  that  the  above  relationship  holds  for 
at  least  a  1  —  5  fraction  of  the  possible  algorithm  histories.  Eor  the  other  5  fraction,  the 
difference  might  be  as  much  as  5MT.  Writing  the  overall  expectation  as  the  sum  of  two 
expectations  conditioned  on  whether  or  not  the  bound  holds  gives  the  result.  □ 
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6.4.3  Relating  the  Loss  of  EGA  and  its  GEX  Subroutine 

Now  we  prove  a  statement  like  Equation  (6.3),  relating  loss(BGA)  to  loss(GEX). 
Theorem  6.4.4.  If  we  run  BGA  with  parameter  7  against  V  for  T  timesteps, 
E[loss{BGA)]  <  (1  —  't)E[loss{GEX)]  +  'yMT. 

Proof  Eor  a  given  adversary  V,  fully  determines  the  sequence  of  cost  vectors  given 
to  algorithm  GEX.  So,  we  can  view  GEX  as  a  function  from  to  probability  distribu¬ 
tions  over  S.  If  we  present  a  cost  vector  c  to  GEX,  then  the  expected  cost  to  GEX  given 
history  is  I  (c  ■  x).  If  we  define  x*  =  I  we 

can  re-write  the  expected  loss  of  GEX  against  c  as  c  ■  x*;  that  is,  we  can  view  GEX  as 
incurring  the  cost  of  some  convex  combination  of  the  possible  decisions  in  expectation. 

'■-tj  ''t 

Eet  £  be  £  given  that  we  explore  by  playing  basis  vector  on  time  t,  and  similarly  let 
Observe  that  =  ^£\  for  j  =  i  and  0  otherwise,  and  so 

n 

“77 


Now,  we  can  write 


E[z^  \  G^  =  (1  —  7)  0  -f  7  'y^  —  'y^  Pr(x*  I  G^  ■  x' 


n 

i=i  x*e5 


E 

.i=i 


— c 

n 


n 


c*  ■  x^ 


Li=i 


■  X*,  and  using  Equation  (6.6), 


Now,  we  consider  the  conditional  expectation  of  and  see  that 


E[z^\Gy  =  (l-7)(c*-x')+7^-(cEb 


2=1 


<  (1  -  'y)E[z^  I  +  7M, 


(6.7) 
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Then  we  have, 


E[z^]  =  E  [E[z^  I  G^-^]] 

<  E  [{1-  I  +  7M] 

=  (1  -  7)E  [E[z*  I  G*-^]]  +  7M 

=  (1  -  7)E[i*]  +  7M,  (6.8) 

by  using  the  inequality  from  Equation  (6.7).  The  theorem  follows  by  summing  the  in¬ 
equality  (6.8)  over  t  from  1  to  T  and  applying  linearity  of  expectation.  □ 

6.4.4  A  Bound  on  the  Expected  Regret  of  BGA 

Theorem  6.4.5.  If  we  run  BGA  with  parameter  7  using  subroutine  GEX  with  parameter  e 
(as  defined  in  Appendix  C),  then  for  all  5  G  (0, 1], 

E[loss(BGA)] 

<  E[OPT]  +  o( D-nMJ2  \n(2n/S)Vf  +  SMT  +  +  -  +  7MT  ) 

V  7  7  e  / 


Proof  In  Appendix  C,  we  show  an  algorithm  to  plug  in  for  GEX,  based  on  the  FPL  al¬ 
gorithm  of  Kalai  and  Vempala  [2003]  and  give  bounds  on  regret  against  a  deterministic 
adaptive  adversary.  We  first  show  how  to  apply  that  analysis  to  GEX  running  as  a  subrou¬ 
tine  to  BGA. 

Eirst,  we  need  to  bound  |c*  ■  x|.  By  definition,  for  any  x  G  S',  we  can  write  x  =  Bw 
for  weights  w  with  Wj  G  [—1, 1]  (or  [—2,  2]  if  it  is  an  approximate  barycentric  spanner). 

Note  that  \\i  ||i  <  (-)M,  and  for  any  x  G  S',  we  can  write  x  as  Bw  where  Wj  G  [—2,  2]. 
Thus, 

|c‘  ■  x|  =  \{B^)~^£  ■  Bw\  =  \{£  yB~^Bw\  =  \£  ■  w|  <  \\£  ||i  ||w||oo  <  , 

7 

Eet  R  =  2nM/7.  Suppose  at  the  beginning  of  time  we  fix  the  random  decisions  of  BGA 
that  are  not  made  by  GEX,  that  is,  we  fix  a  sequence  X  =  [7^,  b^, . . . ,  7^,  b^].  Eixing 
this  randomness  together  with  V  determines  a  new  deterministic  adaptive  adversary  V 
that  GEX  is  effectively  playing  against.  To  see  this,  let  lf~^  =  [x^, . . .  If  we 

combine  lf~^  with  the  information  in  X,  it  fully  determines  a  partial  history  G*“^.  If  we 
let  =  [x^, . . . ,  x*“^]  be  the  partial  decision  history  that  can  be  recovered  from  G*“^, 
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then  V{h}  ^)-  Thus,  when  GEX  is  run  as  a  subroutine  of  EGA,  we  can 

apply  Lemma  (C.0.4)  from  the  Appendix  and  conclude 

- ^Ti 

E[loss(GEX)  I  X]  <  E[OPT  I  X]  +  e(4n  +  2)R^T  +  —  (6.9) 

Eor  the  remainder  of  this  proof,  we  use  big-Oh  notation  to  simplify  the  presentation.  Now, 
taking  the  expectation  of  both  sides  of  Equation  (6.9), 

E[loss(GEX)]  <  .E[OPT]  +  O  {enR^T  + 

Applying  Theorem  (6.4.4), 

E[loss(BGA)]  <  (1  -  7).E[OFr]  +  O  {enR^T  +  ^  + 

and  then  using  Theorem  (6.4.3)  we  have 
E[loss(BGA)] 

<  (1  -  7).E[0PT]  +  O  -f)DVf  +  6MT  +  enR^T  +  ^  +  7MT) 

<  E[OPT]  +  O  (D-nMJ2  \n{2n/S)Vf  +  6MT  +  +  -+  -fMT 

V  7  7^  e 

Eor  the  last  line,  note  that  while  E[OPT]  could  be  negative,  it  is  still  bounded  by  MT, 
and  so  this  just  adds  another  7MT  term,  which  is  captured  in  the  big-Oh  term.  □ 

Ignoring  the  dependence  on  n,  M,  and  D  and  simplifying,  we  see  BGA’s  expected 
regret  is  bounded  by 


E[regret(BGA)]  =  (9  ^  j  , 

Setting  7  =  5  =  and  e  =  we  get  a  bound  on  our  loss  of  order  (9(T^/^\/lnT). 

6.5  Conclusions  and  Later  Work 

We  have  presented  a  general  algorithm  for  online  optimization  over  an  arbitrary  set  of  de¬ 
cisions  S  and  proved  regret  bounds  for  our  algorithm  that  hold  against  an  adaptive 
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adversary.  A  number  of  questions  are  raised  by  this  work.  In  the  “flat”  bandits  prob¬ 
lem,  bounds  of  the  form  0{\/T)  are  possible  against  an  adaptive  adversary  [Auer  et  ah, 
2002].  Against  a  oblivious  adversary  in  the  geometric  case,  a  bound  of  is  achieved 

by  Awerbuch  and  Kleinberg  [2004].  We  achieve  a  bound  of  0{T^^‘^V\nT)  for  this  prob¬ 
lem  against  an  adaptive  adversary.  Auer  et  al.  [2002]  give  lower  bounds  showing  that  the 
0{\/T)  result  is  tight,  but  no  such  bounds  are  known  for  the  geometric  decision-space 
problem.  Can  our  bounds  be  improved,  and  what  is  the  corresponding  lower  bound  for  the 
problem? 

After  the  publication  of  the  work  described  here,  these  questions  were  answered  by 
Dani  and  Hayes  [2006] .  They  show  that  a  tighter  analysis  of  the  algorithm  presented  here 
in  fact  has  a  bound  of  on  regret,  and  they  show  a  corresponding  lower  bound. 

A  related  issue  is  the  use  of  information  received  by  the  algorithm;  our  algorithm  and 
the  algorithm  of  Awerbuch  and  Kleinberg  [2004]  only  use  a  7  fraction  of  the  feedback  they 
receive,  which  is  intuitively  unappealing.  Further,  the  lower  bound  of  Dani  and  Hayes 
[2006]  depends  on  the  assumption  that  the  information  from  the  1  —  7  fraction  exploit 
rounds  is  discarded.  This  leaves  open  the  possibility  that  an  algorithm  that  uses  all  of  the 
feedback  can  possibly  achieve  lower  regret. 
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Chapter  7 
Conclusions 


7.1  Summary  of  Contributions 

This  thesis  makes  the  following  principal  contributions;  taken  together,  they  provide  a 
powerful  set  of  modeling  and  algorithmic  tools  for  creating  robust  plans  of  action  for 
uncertain  environments. 

•  Fast  algorithms  for  MDP  planning:  The  improved  prioritized  sweeping  (IPS)  al¬ 
gorithm  generalizes  Dijkstra’s  algorithm,  and  is  fast  on  problems  that  are  “almost” 
deterministic.  The  prioritized  policy  iteration  algorithm  combines  the  intuition  of 
IPS  with  fast  policy  evaluation  using  linear  solvers,  yielding  good  all-around  perfor¬ 
mance;  it  is  especially  effective  on  problems  with  a  great  deal  of  cycling  or  on  prob¬ 
lems  that  are  “almost”  policy  evaluation  problems.  For  problems  with  a  fixed  start 
state,  the  bounded  real-time  dynamic  programming  (BRTDP)  algorithm  improves 
over  RTDP  by  providing  stationary  policies  with  provable  performance  guarantees. 
BRTDP  also  offers  better  convergence  properties  than  many  other  algorithms  for 
this  problem  such  as  HDP  and  LRTDP. 

•  The  MDP  with  adversarial  costs  formulation:  We  introduced  a  generalization  of 
standard  MDP  planning  that  considers  a  set  of  potential  cost  vectors,  from  which  an 
adversary  selects  one,  rather  than  a  fixed  known  cost  vector.  This  formulation  can 
be  used  to  model  a  variety  of  interesting  problems,  including  a  sensor-placement  / 
observation-avoidance  game. 

•  New  uses  for  the  convex  game  framework:  We  show  that  optimal  oblivious  rout¬ 
ing  as  well  as  the  above  adversarial  MDP  problem  can  be  modeled  as  bilinear-payoff 
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zero-sum  convex  games,  and  show  how  convex  games  can  be  used  to  extend  the 
stochastic  game  framework  to  handle  periods  of  partial  observability.  Combined 
with  the  known  result  that  zero-sum  extensive-form  games  can  be  represented  as 
convex  games,  these  results  establish  convex  games  as  a  useful  modeling  frame¬ 
work,  and  highlight  the  importance  of  finding  fast  algorithms  for  problems  in  this 
class. 

•  Fast  algorithms  for  convex  games:  We  introduced  the  single  oracle  algorithm  and 
two  versions  of  the  double  oracle  algorithm,  and  experimentally  demonstrated  their 
effectiveness  on  a  variety  of  convex  games.  These  algorithms  dramatically  outper¬ 
form  approaches  based  on  directly  solving  the  linear  programs  for  the  games.  Ficti¬ 
tious  play,  a  very  simple  oracle-based  algorithm,  had  remarkably  good  performance 
on  one  of  the  problems,  Rhode  Island  hold’em. 

•  A  limited-observation  geometric  no-regret  algorithm:  The  bandit-style  geomet¬ 
ric  decision  algorithm  (EGA)  provides  no-regret  guarantees  given  complex  struc¬ 
tured  action  sets  and  a  limited  total-cost  observation  model,  even  when  facing  an 
adaptive  adversary.  This  algorithm  can  be  used  to  guarantee  good  performance  when 
playing  a  repeated  convex  game. 


7.2  Summary  of  Open  Questions  and  Future  Work 

In  the  preceding  chapters  we  have  highlight  a  variety  of  promising  extensions  to  the  work 
presented  here.  In  this  section  we  summarize  some  of  those  possibilities  for  future  work, 
and  also  state  several  open  questions  and  general  themes. 

•  Convex  games:  Chapter  3  presented  a  variety  of  examples  of  convex  games,  and 
Chapters  5  and  6  presented  practical,  efficient  algorithms  for  planning  in  convex 
games  in  the  off-line  and  on-line  settings,  respectively.  We  feel  that  there  are  po¬ 
tentially  many  other  interesting  problems  that  can  be  cast  as  convex  games,  yielding 
immediate  algorithmic  and  theoretical  results. 

•  Extension  to  NP-hard  response  problems:  For  the  convex  games  we  have  con¬ 
sidered,  fast  exact  best-response  oracles  were  available.  However,  the  double  oracle 
algorithm  approach  holds  great  potential  for  solving  problems  when  even  the  best- 
response  problem  is  NP-hard.  For  example,  we  might  consider  delivery  problem 
games  where  an  approximation  algorithm  for  the  traveling  salesman  problem  is  used 
as  a  best-response  oracle. 
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•  Improving  the  DOBA+  algorithm:  Significant  improvements  to  the  DOBA+ 
algorithm  may  be  possible,  while  still  using  the  general  double  oracle  game- 
approximation  technique.  Future  work  includes  developing  efficient  techniques 
for  finding  “good”  best  responses  (Section  5.4),  developing  a  better  understanding 
of  the  connections  to  bundle  algorithms  for  non-smooth  optimization  (See  [Hiriart- 
Urruty  and  Lemarechal,  1993]  for  example),  and  proving  convergence  guarantees 
(perhaps  through  better  aggregation/discarding  strategies)  that  do  not  rely  on  inter¬ 
polation  with  fictitious  play. 

•  Algorithms  specialized  for  EFGs:  In  Chapter  5,  we  focused  on  developing  algo¬ 
rithms  applicable  to  general  convex  games.  However,  extensive-form  games  have 
significant  structure  that  is  perhaps  not  fully  exploited  even  by  considering  them 
as  a  convex  game  with  very  fast  best  response  oracles.  For  example,  rather  than 
a  black-box  best-response  algorithm,  the  dynamic-programming  algorithm  for  best 
responses  on  an  EFG  allows  efficient  linear  optimization  over  the  full  set  of  best 
responses,  as  well  as  methods  for  sampling  from  or  even  enumerating  the  set.  The 
tree  structure  of  EFGs  also  opens  up  the  possibility  of  decomposition  algorithms — 
we  have  already  done  preliminary  work  on  such  an  algorithm. 

•  More  efficient  EFG  representations:  The  convex  extensive-form  game  model  of 
Chapter  4  shows  that  the  standard  EFG  representation  can  be  very  inefficient:  many 
EFGs  have  exponentially  more  compact  representations  as  CEFGs. 

The  GameShrink  algorithm  of  Gilpin  and  Sandholm  [2006b]  provides  another  ap¬ 
proach  to  creating  more  compact  EFG  representations:  their  algorithm  can  take  a 
special  type  of  EFG  game  and  transform  it  to  a  potentially  much  smaller  but  strate¬ 
gically  equivalent  EFG. 

We  have  already  begun  a  preliminary  line  of  work  that  shows  that  the  approach  of 
Gilpin  and  Sandholm  [2006b]  can  be  greatly  extended.  How  much  more  is  possi¬ 
ble?  Clearly  arbitrary  POSGs  cannot  be  represented  as  EFGs  or  CEFGs  (barring  a 
complete  collapse  of  the  complexity  hierarchy),  but  pushing  on  the  representative 
power  of  EFG-like  game  models  seems  to  be  an  attractive  avenue  for  scaling  up 
game-theoretic  planning  approaches. 

•  Extensive-form  games  and  cost-paired  MDP  games:  There  appears  to  be  a  a 
close  connection  between  extensive-form  games  and  cost-paired  MDP  games.  An 
extensive-form  game  in  sequence-weight  representation  can  be  specified  by  two 
trees  (the  information  set  /  sequence  tree  for  each  player)  together  with  a  matrix 
that  provides  a  linear  mapping  from  strategies  in  one  tree  to  edge-costs  in  the  other, 
and  vice  versa.  Each  tree  has  player  nodes  (corresponding  to  information  sets)  and 
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non-player  nodes  where  the  next  state  is  chosen  according  to  the  uniform  distribu¬ 
tion.^  Under  this  interpretation,  the  sequence  weights  in  the  extensive-form  game 
become  exactly  the  state-action  visitation  frequencies  in  a  cost-paired  MDP  game. 

This  connections  raises  several  interesting  questions.  First,  the  MDPs  in  cost-paired 
MDP  games  can  contain  cycles,  while  the  information  set  trees  of  an  EFG  are  by 
definition  acyclic.  In  this  way,  cost-paired  MDP  games  are  actually  more  general 
than  EFGs.  What  does  this  generality  imply  when  we  interpret  a  cost-paired  MDP 
game  as  an  EEG? 

Second,  Even-Dar  et  al.  [2005]  present  an  interesting  algorithm  for  acting  in  an  MDP 
where  costs  can  change.  This  is  exactly  what  happens  in  a  cost-paired  MDP  game 
(or  an  EEG  under  this  interpretation)  when  one  player  changes  their  policy.  Thus,  it 
should  be  possible  to  adapt  the  theoretical  guarantees  of  Even-Dar  et  al.  [2005]  to 
EEGs  where  we  imagine  placing  a  no-regret  (experts)  algorithm  at  each  information 
set  for  each  player.  With  suitable  simulation  in  this  game,  it  should  be  possible  to 
prove  both  players  converge  to  a  minimax  optimal  strategy. 

As  this  summary  shows,  this  thesis  raises  more  questions  than  it  answers.  Perhaps  the 
only  certainty  is  that  the  problem  of  planning  in  uncertain  environments  remains  both 
challenging  and  important. 


'This  distribution  can  potentially  also  directly  account  for  some  actions  of  the  random  player,  making  it 
non-uniform.  In  the  usual  sequence-form  best  response  dynamic  program,  the  “junction”  nodes  are  thought 
of  as  sum  nodes  instead  of  average  nodes;  however,  it  is  simple  to  transform  the  problem  between  these  two 
interpretations. 
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Appendix  A 


The  Transition  Functions  of  a  CEFG 
Interpreted  as  Probabilities 


Lemma  A.0.1.  For  any  CEFG  G  where  some  fp^'{xp)  ^  [0, 1],  there  exists  an  equivalent 
CEFG  G'  where  ff{xp)  e  [0, 1], 

Proof.  Fix  a  particular  problematic  s,  s'  in  G;  as  these  states  are  fixed,  we  drop  the  s,  s' 
superscripts  from  the  /-functions.  To  prove  the  theorem,  we  define  an  /-equivalent  G' 
with  /-functions  denote  g  where  gp^  G  [0, 1]  for  all  p.  As  G  has  a  finite  number  of  edges, 
this  transformation  can  be  applied  repeatedly  to  prove  the  lemma.  Define 

fix)  =  Ylfpixp). 

p 

Each  player  chooses  Xp  G  independently.  Thus,  if  for  some  player  p  there  exist 

X,  x'  G  Xu  such  that  fp{x)  >  0  and  fpix')  <  0,  then  fixing  the  other  players  actions,  either 
X  or  x'  would  make  Pr(s'  |  s,  x)  =  npe.4(s)  ffixp)  negative,  violating  Equation  (4.2). 

Thus,  fpixp)  always  has  the  same  sign  for  all  Xp  G  Since  Pr(s'  |  s)  must  be 
non-negative,  there  must  be  an  even  number  (possibly  0)  of  players  where  fpixp)  <  0.  We 
can  simply  switch  the  sign  on  all  of  these  players’  /-functions,  creating  an  /-equivalent 
game  where  fpixp)  >  0  for  all  players. 

Suppose  for  some  players  and  action  choices,  fpixp)  >  1.  Por  each  player,  define 
Xp  =  argmax^gjf^  fpix),  and  let  ap  =  fpiXp).  By  assumption,  /3  =  Y\p^p  ^  this 
is  the  probability  of  the  s  to  s'  transition  when  each  player  selects  x*.  Now,  for  G'  define 
gp  =  {l/ap)fp.  Clearly  gp{x)  <  1,  as  we  are  dividing  by  the  maximum  value.  However, 
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we  are  off  by  the  constant  (3,  as  for  any  x  G  Xg,  f{x)  =  Pg{x).  We  can  resolve  this  easily 
enough,  however,  as  /3  <  1.  We  simply  set  gi  =  {/3/ai)fi  instead  of  gi  =  {l/ai)fi  and 
so  gi{xi)  G  [0,  /3]  instead  of  gi{xi)  G  [0, 1].  We  now  see  that  for  all  x, 

J_  J_  fpi^p)  ~  J_  J_  9p{^p) 

p  p 

and  so  G'  is  /-equivalent  to  G  and  has  /-functions  for  the  s  — >  s'  transition  that  only  take 
on  values  in  [0, 1].  □ 
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Appendix  B 

The  Cone  Extension  of  a  Polyhedron 


Let  X  =  {x  I  Ax  =  6,  a;  >  0}  be  a  polyhedron.  Then,  the  cone  extension  of  X  is 

=  {{ax,  a)  \  X  E  X,a  >  0}  (B.l) 

(B.2) 

Thus,  {x‘^,  a)  E  X^  if  and  only  if  (1/Q;)a;^  G  X  and  a  >  0.  Then,  writing  {x^,  a)  for  the 
(column)  vector  in  formed  by  appending  a  to  the  end  of  x^,  we  have 


{l/a)x^  E  X  and  a  >  0 

A(1/q; 

)x"  =  b  and  (x",  a)  >  0 

(B.3) 

Ax"  = 

ab  and  (x",  a)  >  0 

(B.4) 

(^,  h) 

x",  a)  =  0  and  (x",  a)  >  0, 

(B.5) 

where  [A,  b]  is  the  matrix  formed  by  adding  &  as  a  new  final  column  to  A.  Thus, 

X^  =  {{x^,  a)  I  {A,  b)  {x^,  a)  =  0,  {x'^,  a)  >  0}. 


Ill 
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Appendix  C 

Specification  of  a  Geometric  Experts 
Algorithm 


In  this  section  we  point  out  how  the  FPL  algorithm  and  analysis  of  Kalai  and  Vempala 
[2003]  can  be  adapted  to  our  setting  to  use  as  the  GEX  subroutine,  and  prove  the  corre¬ 
sponding  bound  needed  for  Theorem  (6.4.5).  In  particular,  we  need  a  bound  for  an  arbi¬ 
trary  S  and  arbitrary  cost  vectors,  requiring  only  that  on  each  timestep,  |c  ■  x|  <  R. 
Further,  the  bound  must  hold  against  an  adaptive  adversary. 

FPL  solves  the  online  optimization  problem  when  the  entire  cost  vector  c*  is  observed 
at  each  timestep.  It  maintains  the  sum  and  on  each  timestep  plays  decision  x*  = 

where  n  is  chosen  uniformly  at  random  from  [0, 1/e]”,  given  e,  a  parameter 
of  the  algorithm.  The  analysis  of  FPL  in  Kalai  and  Vempala  [2003]  assumes  positive 
cost  vectors  c  satisfying  ||c||i  <  A,  and  positive  decision  vectors  from  S  C  M”  with 
||x  —  y||i  <  D  for  all  x, y  G  S'  and  |c  ■  x  —  c  ■  y|  <  R  for  all  cost  vectors  c  and 
X,  y  G  S'.  Further,  the  bounds  proved  are  with  respect  to  a  fixed  series  of  cost  vectors,  not 
an  adaptive  adversary.  We  now  show  how  to  bridge  the  gap  from  these  assumptions  to  our 
assumptions. 

First,  we  adapt  an  argument  from  Awerbuch  and  Kleinberg  [2004],  showing  that  by 
using  our  barycentric  spanner  basis,  we  can  transform  our  problem  into  one  where  the 
assumptions  of  FPL  are  met.  We  then  argue  that  a  corresponding  bound  holds  against  an 
adaptive  adversary. 


Lemma  C.0.2.  Let  S  MR  be  a  set  of  (not  necessarily  positive)  decisions,  and  = 
[c^, . . . ,  c^j  a  set  of  cost  vectors  on  those  decisions,  such  that  |c*  ■  x|  <  Rfor  all^  &  S 


179 


and  c*  G  k*.  Then,  there  is  an  algorithm  ^(e)  that  achieves 

4:77 

E[loss(^(e),  k^)]  <  OPT(A;‘)  +  e(4n  +  2)R^T  +  — 

Proof.  This  an  adaptation  of  the  arguments  of  Appendix  A  of  Awerbuch  and  Kleinberg 
[2004].  Fix  a  barycentric  spanner  B  =  {bi, . . . ,  b„}  for  S.  Then,  for  each  x  G  S',  let 
X  =  i?w  and  define /(x)  =  [— X]r=i •  •  •  > Let /(S')  =  S'.  For  each  cost 
vector  c*  define  (7(0*)  =  [R,  i?  +  c*  ■  bsi,  . . . ,  R  +  c'  ■  b„].  It  is  straightforward  to 
verify  that  c*  ■  x  =  g{c')  ■  /(x),  and  further  g{c')  >  0,  ||(7(c*)||i  <  (2n  +  1)R,  and  the 
difference  in  cost  of  any  two  decisions  against  a  fixed  (7(0*)  is  at  most  2R.  By  definition 
of  a  barycentric  spanner,  w*  G  [—1, 1]  and  so  the  Li  diameter  of  S'  is  at  most  4n.  Note  the 
assumption  of  positive  decision  vectors  in  Theorem  1  of  Kalai  and  Vempala  [2003]  can 
easily  be  lifted  by  additively  shifting  the  space  of  decision  vectors  until  it  is  positive.  This 
changes  the  loss  of  the  algorithm  and  of  the  best  decision  by  the  same  amount,  so  additive 
regret  bounds  are  unchanged.  The  result  of  this  lemma  then  follows  from  the  bound  of 
Theorem  1  from  Kalai  and  Vempala  [2003].  □ 


We  now  need  to  extend  the  above  bound  to  adaptive  adversaries.  The  key  point  here 
is  that  the  algorithm  is  self -oblivious.  A  self-oblivious  algorithm  always  plays  a  decision 
from  some  distribution  that  depends  only  on  the  cost  history  so  far  and  not  the  outcome 
of  its  previous  probabilistic  choices.  Thus,  a  self-oblivious  algorithm  can  be  viewed  as  a 
function  from  cost  histories  to  distributions  over  decisions.  For  such  algorithms,  for  any 
(possibly  adaptive)  adversary  V  there  always  exists  an  oblivious  adversary  that  causes  at 
least  as  much  regret.  The  idea  for  the  proof  below  is  due  to  Adam  Kalai.  ^ 


Lemma  C.0.3.  Fix  T,  let  H*  be  the  set  of  decision  histories  of  length  0  to  T  —  1,  and 
let  K*  be  the  set  of  all  cost  histories  of  length  0  to  T  —  1.  Then,  fix  a  decision  algorithm 
A  ;  K*  — >  A(S'),  where  A(S')  is  the  set  of  probability  distributions  on  the  set  S  of  possible 
decisions.  Define 


R(A,V)  =  Ej,y 


E 

t=l 


c*x*  —  min 
xg5 


E 

t=i 


c*x 


Let  V  be  an  arbitrary  adversary.  Then,  there  exists  an  oblivious  adversary  V'  such  that 


R{A,V)  >  R{A,V) 


'  We  thank  Tom  Hayes  and  Varsha  Dani  for  pointing  out  a  bug  in  the  proof  we  had  in  the  original  version 
of  this  paper. 
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Proof.  An  adversary  is  t-oblivious  if  its  first  t  costs  are  chosen  obliviously;  note  all  adver¬ 
saries  are  1-oblivious.  Let  V  be  an  arbitrary  adversary,  and  suppose  it  is  fc-oblivious.  If 
k  =  T,  we  are  done.  Otherwise,  let  c^, . . . ,  be  the  first  k  (obliviously  chosen)  costs  se¬ 
lected  by  V.  Expectations  are  over  the  random  variables  , . . . ,  and  , . . . ,  when  V 
plays  against  A,  though  in  this  case  c^, . . . ,  are  fully  determined.  Let  =  A, ,  c^, 
the  random  vector  corresponding  to  the  cost  history. 

Let  g{K'^)  =  minxes  Using  linearity  of  expectation,  we  can  split  the  ex¬ 

pected  regret  R{A,  V)  into  3  terms: 

E[J2  CoxI  +  +  E[J2  -  g{K^)] 

t=l  t=k+2 


Since  A  and  . . . ,  are  fixed,  L'[x^+^]  =  E[A{cl, . . . ,  c^)]  =  a;  is  also  known. 
Since  V  is  only  /c-oblivious,  it  gets  to  pick  with  knowledge  of  x\  . . . ,  x^.  We  have 


where  X  is  an  indicator  function,  returning  1  if  V(x\  . . . ,  x^)  =  and  zero  otherwise. 
The  probability  Pr(x\  . . . ,  x^)  is  well  defined  because  V  and  A  are  fixed.  Importantly, 
note  that  the  distribution  over  is  independent  of  the  distribution  over  x^+^;  this  fol¬ 
lows  from  the  assumption  that  A  is  self-oblivious,  that  is,  it  picks  its  distributions  based 
only  on  the  past  cost  vectors,  not  on  its  own  actions.  Thus,  letting  CqX*]  we 

can  write 


R{A,  V)  =  +  xE[c’^+^]  +E[J2  c'x‘  -  g{K^)] 


t=k+2 


=  X^  + 


Pr(c 


pfc  +  1 


+  E[J2  c*x*  -  g{K^)  I  c 


t=k-\-2 


<L^  +  sup 


+  E[J2  cV  -  g{K^)  I 


(C.l) 

I  ^fc+ii  I  ^^k+i 

(C.3) 


t=k+2 


where  the  sup  is  over  all  with  Pr(c^’''^)  >  0.  Observe  that  the  quantity  inside  the 
supremum  is  well  defined  before  any  costs  or  decisions  are  selected,  and  so  V  could  do  at 
least  as  well  by  selecting  obliviously  to  be  some  c  that  achieves  the  supremum.  Thus, 
there  is  a  {k  +  l)-oblivious  adversary  that  causes  at  least  as  much  regret  as  V.  Extending 
this  result  inductively,  we  conclude  there  is  a  fully  oblivious  (T -oblivious)  adversary  V' 
such  that  R{A,  V)  >  R{A,  V).  □ 
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Lemma  C.0.4.  The  regret  bound  from  Lemma  C.0.2  applies  even  if  the  adversary  is  adap¬ 
tive. 

Proof  First,  observe  that  as  long  as  FPL  re-randomizes  at  eaeh  timestep,  it  is  self- 
oblivious,  and  so  Lemma  C.0.3  applies.  Suppose  some  adaptive  adversary  V  causes  regret 
that  exceeds  the  bound  in  Lemma  C.0.2.  We  can  apply  Lemma  C.0.3  to  V  and  construct 
an  oblivious  V  that  also  exceeds  the  bound,  a  contradiction.  □ 

Thus,  we  can  use  ^(e)  as  our  GEX  subroutine  for  full-observation  online  geometric 
optimization. 


182 


Appendix  D 
Notions  of  Regret 


In  Auer  et  al.  [1995],  an  alternative  definition  of  regret  is  given,  namely. 


[lossy  —  mini? 

x£S 


(D.l) 


This  definition  is  equivalent  to  ours  in  the  case  of  an  oblivious  adversary,  but  against  an 
adaptive  adversary  the  “best  decision”  for  this  definition  is  not  the  best  decision  for  a 
particular  decision  history,  but  the  best  decision  if  the  decision  must  be  chosen  before  a 
cost  history  is  selected  according  to  the  distribution  over  such  histories.  In  particular. 


T 

■  T 

min  >  c*  ■  X 

x£S  ^ 
t=l 

<  mini? 

xGS 

^C*-X 

.t=l 

and  so  a  bound  on  Equation  (6.1)  is  at  least  as  strong  as  a  bound  on  Equation  (D.l).  In 
fact,  bounds  on  Equation  (D.l)  can  be  very  poor  when  the  adversary  is  adaptive.  There  are 
natural  examples  where  the  stronger  definition  (6.1)  gives  regret  0{T)  while  the  weaker 
definition  (D.l)  indicates  no  regret.  Adapting  an  example  from  Auer  et  al.  [1995],  let  S  = 
{oi, . . . ,  e„}  (the  “flat”  bandit  setting)  and  consider  the  algorithm  A  that  plays  uniformly 
at  random  from  S.  The  adversary  V  gives  =  0,  and  if  A  then  plays  e*  on  the  first 
iteration,  thereafter  the  adversary  plays  the  cost  vector  A  where  c*  =  0  and  c*  =  1  for  j  ^ 
i.  The  expected  loss  of^is  Eor  regret  as  defined  by  Equation  (D.l),  minxes 
x]  =  indicating  no  regret,  while  i? [minxes (c^'^  ■  x)]  =  0,  and  so  the  stronger 

definition  indicates  0{T)  regret. 

Unfortunately,  this  implies  the  proof  techniques  for  bounds  on  expected  weak  regret 
like  those  in  Auer  et  al.  [2002]  and  Awerbuch  and  Kleinberg  [2004]  cannot  be  used  to  get 
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bounds  on  regret  as  defined  by  Equation  (6.1).  The  problem  is  that  even  if  we  have  unbi¬ 
ased  estimates  of  the  costs,  these  cannot  be  used  to  evaluate  the  term  ^'[minj^gs  ' 

x)]  in  (6.1)  because  min  is  a  non-linear  operator.  We  surmount  this  problem  by  proving 
high-probability  bounds  on  our  estimates  of  c*,  which  allows  us  to  use  a  union  bound 
to  evaluate  the  expectation  over  the  min  operator.  Note  that  the  high  probability  bounds 
proved  in  Auer  et  al.  [2002]  and  Awerbuch  and  Kleinberg  [2004]  can  be  seen  as  corre¬ 
sponding  to  our  definition  of  expected  regret. 
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